Message ID | 1594735034-19190-1-git-send-email-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] memcg, oom: check memcg margin for parallel oom | expand |
On Tue 14-07-20 09:57:14, Yafang Shao wrote: > Memcg oom killer invocation is synchronized by the global oom_lock and > tasks are sleeping on the lock while somebody is selecting the victim or > potentially race with the oom_reaper is releasing the victim's memory. > This can result in a pointless oom killer invocation because a waiter > might be racing with the oom_reaper > > P1 oom_reaper P2 > oom_reap_task mutex_lock(oom_lock) > out_of_memory # no victim because we have one already > __oom_reap_task_mm mute_unlock(oom_lock) > mutex_lock(oom_lock) > set MMF_OOM_SKIP > select_bad_process > # finds a new victim > > The page allocator prevents from this race by trying to allocate after > the lock can be acquired (in __alloc_pages_may_oom) which acts as a last > minute check. Moreover page allocator simply doesn't block on the > oom_lock and simply retries the whole reclaim process. > > Memcg oom killer should do the last minute check as well. Call > mem_cgroup_margin to do that. Trylock on the oom_lock could be done as > well but this doesn't seem to be necessary at this stage. > > [mhocko@kernel.org: commit log] > Suggested-by: Michal Hocko <mhocko@kernel.org> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> > Cc: David Rientjes <rientjes@google.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > --- > v1 -> v2: > - commit log improved by Michal > - retitle the subject from "mm, oom: check memcg margin for parallel oom" > - code simplicity, per Michal > --- > mm/memcontrol.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 1962232..15e0e18 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > .gfp_mask = gfp_mask, > .order = order, > }; > - bool ret; > + bool ret = true; > > if (mutex_lock_killable(&oom_lock)) > return true; > + > + if (mem_cgroup_margin(memcg) >= (1 << order)) > + goto unlock; > + > /* > * A few threads which were not waiting at mutex_lock_killable() can > * fail to bail out. Therefore, check again after holding oom_lock. > */ > ret = should_force_charge() || out_of_memory(&oc); > + > +unlock: > mutex_unlock(&oom_lock); > return ret; > } > -- > 1.8.3.1
Yafang Shao writes: >Memcg oom killer invocation is synchronized by the global oom_lock and >tasks are sleeping on the lock while somebody is selecting the victim or >potentially race with the oom_reaper is releasing the victim's memory. >This can result in a pointless oom killer invocation because a waiter >might be racing with the oom_reaper > > P1 oom_reaper P2 > oom_reap_task mutex_lock(oom_lock) > out_of_memory # no victim because we have one already > __oom_reap_task_mm mute_unlock(oom_lock) > mutex_lock(oom_lock) > set MMF_OOM_SKIP > select_bad_process > # finds a new victim > >The page allocator prevents from this race by trying to allocate after >the lock can be acquired (in __alloc_pages_may_oom) which acts as a last >minute check. Moreover page allocator simply doesn't block on the >oom_lock and simply retries the whole reclaim process. > >Memcg oom killer should do the last minute check as well. Call >mem_cgroup_margin to do that. Trylock on the oom_lock could be done as >well but this doesn't seem to be necessary at this stage. > >[mhocko@kernel.org: commit log] >Suggested-by: Michal Hocko <mhocko@kernel.org> >Signed-off-by: Yafang Shao <laoar.shao@gmail.com> >Cc: Michal Hocko <mhocko@kernel.org> >Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> >Cc: David Rientjes <rientjes@google.com> >Cc: Johannes Weiner <hannes@cmpxchg.org> Good catch, thanks. Acked-by: Chris Down <chris@chrisdown.name>
On Tue, 14 Jul 2020, Yafang Shao wrote: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 1962232..15e0e18 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > .gfp_mask = gfp_mask, > .order = order, > }; > - bool ret; > + bool ret = true; > > if (mutex_lock_killable(&oom_lock)) > return true; > + > + if (mem_cgroup_margin(memcg) >= (1 << order)) > + goto unlock; > + > /* > * A few threads which were not waiting at mutex_lock_killable() can > * fail to bail out. Therefore, check again after holding oom_lock. > */ > ret = should_force_charge() || out_of_memory(&oc); > + > +unlock: > mutex_unlock(&oom_lock); > return ret; > } Hi Yafang, We've run with a patch very much like this for several years and it works quite successfully to prevent the unnecessary oom killing of processes. We do this in out_of_memory() directly, however, because we found that we could prevent even *more* unnecessary killing if we checked this at the "point of no return" because the selection of processes takes some additional time when we might resolve the oom condition. Some may argue that this is unnecessarily exposing mem_cgroup_margin() to generic mm code, but in the interest of preventing any unnecessary oom kill we've found it to be helpful. I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, void mem_cgroup_split_huge_fixup(struct page *head); #endif +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg); + #else /* CONFIG_MEMCG */ #define MEM_CGROUP_ID_SHIFT 0 @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, { } +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) +{ +} + static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, bool in_low_reclaim) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, * Returns the maximum amount of memory @mem can be charged with, in * pages. */ -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) { unsigned long margin = 0; unsigned long count; diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc) if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) panic("System is deadlocked on memory\n"); } - if (oc->chosen && oc->chosen != (void *)-1UL) + if (oc->chosen && oc->chosen != (void *)-1UL) { + if (is_memcg_oom(oc)) { + /* + * If a memcg is now under its limit or current will be + * exiting and freeing memory, avoid needlessly killing + * chosen. + */ + if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) || + task_will_free_mem(current)) { + put_task_struct(oc->chosen); + return true; + } + } + oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : "Memory cgroup out of memory"); + } return !!oc->chosen; }
On Wed, Jul 15, 2020 at 2:46 AM David Rientjes <rientjes@google.com> wrote: > > On Tue, 14 Jul 2020, Yafang Shao wrote: > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 1962232..15e0e18 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > .gfp_mask = gfp_mask, > > .order = order, > > }; > > - bool ret; > > + bool ret = true; > > > > if (mutex_lock_killable(&oom_lock)) > > return true; > > + > > + if (mem_cgroup_margin(memcg) >= (1 << order)) > > + goto unlock; > > + > > /* > > * A few threads which were not waiting at mutex_lock_killable() can > > * fail to bail out. Therefore, check again after holding oom_lock. > > */ > > ret = should_force_charge() || out_of_memory(&oc); > > + > > +unlock: > > mutex_unlock(&oom_lock); > > return ret; > > } > > Hi Yafang, > > We've run with a patch very much like this for several years and it works > quite successfully to prevent the unnecessary oom killing of processes. > > We do this in out_of_memory() directly, however, because we found that we > could prevent even *more* unnecessary killing if we checked this at the > "point of no return" because the selection of processes takes some > additional time when we might resolve the oom condition. > Hi David, Your proposal could also resolve the issue, but I'm wondering why do it specifically for memcg oom? Doesn't it apply to global oom? For example, in the global oom, when selecting the processes, the others might free some pages and then it might allocate pages successfully. > Some may argue that this is unnecessarily exposing mem_cgroup_margin() to > generic mm code, but in the interest of preventing any unnecessary oom > kill we've found it to be helpful. > > I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089. > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > void mem_cgroup_split_huge_fixup(struct page *head); > #endif > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg); > + > #else /* CONFIG_MEMCG */ > > #define MEM_CGROUP_ID_SHIFT 0 > @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > { > } > > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > +{ > +} > + > static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, > bool in_low_reclaim) > { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, > * Returns the maximum amount of memory @mem can be charged with, in > * pages. > */ > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > { > unsigned long margin = 0; > unsigned long count; > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc) > if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) > panic("System is deadlocked on memory\n"); > } > - if (oc->chosen && oc->chosen != (void *)-1UL) > + if (oc->chosen && oc->chosen != (void *)-1UL) { > + if (is_memcg_oom(oc)) { > + /* > + * If a memcg is now under its limit or current will be > + * exiting and freeing memory, avoid needlessly killing > + * chosen. > + */ > + if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) || > + task_will_free_mem(current)) { > + put_task_struct(oc->chosen); > + return true; > + } > + } > + > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > "Memory cgroup out of memory"); > + } > return !!oc->chosen; > } >
On Wed, 15 Jul 2020, Yafang Shao wrote: > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 1962232..15e0e18 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > .gfp_mask = gfp_mask, > > > .order = order, > > > }; > > > - bool ret; > > > + bool ret = true; > > > > > > if (mutex_lock_killable(&oom_lock)) > > > return true; > > > + > > > + if (mem_cgroup_margin(memcg) >= (1 << order)) > > > + goto unlock; > > > + > > > /* > > > * A few threads which were not waiting at mutex_lock_killable() can > > > * fail to bail out. Therefore, check again after holding oom_lock. > > > */ > > > ret = should_force_charge() || out_of_memory(&oc); > > > + > > > +unlock: > > > mutex_unlock(&oom_lock); > > > return ret; > > > } > > > > Hi Yafang, > > > > We've run with a patch very much like this for several years and it works > > quite successfully to prevent the unnecessary oom killing of processes. > > > > We do this in out_of_memory() directly, however, because we found that we > > could prevent even *more* unnecessary killing if we checked this at the > > "point of no return" because the selection of processes takes some > > additional time when we might resolve the oom condition. > > > > Hi David, > > Your proposal could also resolve the issue, It has successfully resolved it for several years in our kernel, we tried an approach similiar to yours but saw many instances where memcg oom kills continued to proceed even though the memcg information dumped to the kernel log showed memory available. If this was a page or two that became available due to memory freeing, it's not a significant difference. Instead, if this races with an oom notification and a process exiting or being SIGKILL'd, it becomes much harder to explain to a user why their process was oom killed when there are tens of megabytes of memory available as shown by the kernel log (the freeing/exiting happened during a particularly long iteration of processes attached to the memcg, for example). That's what motivated a change to moving this to out_of_memory() directly, we found that it prevented even more unnecessary oom kills, which is a very good thing. It may only be easily observable and make a significant difference at very large scale, however. > but I'm wondering why do > it specifically for memcg oom? > Doesn't it apply to global oom? > For example, in the global oom, when selecting the processes, the > others might free some pages and then it might allocate pages > successfully. > It's more complex because memory being allocated from the page allocator must be physically contiguous, it's not a simple matter of comparing the margin of available memory to the memory being charged. It could theoretically be done but I haven't seen a use case where it has actually mattered as opposed to memcg oom where it can happen quite readily at scale. When memory is uncharged to a memcg because of large freeing or a process exiting, that's immediately chargable by another process in the same hierarchy because of its isolation as opposed to the page allocator where that memory is up for grabs and anything on the system could allocate it. > > Some may argue that this is unnecessarily exposing mem_cgroup_margin() to > > generic mm code, but in the interest of preventing any unnecessary oom > > kill we've found it to be helpful. > > > > I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089. > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > > void mem_cgroup_split_huge_fixup(struct page *head); > > #endif > > > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg); > > + > > #else /* CONFIG_MEMCG */ > > > > #define MEM_CGROUP_ID_SHIFT 0 > > @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > > { > > } > > > > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > +{ > > +} > > + > > static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, > > bool in_low_reclaim) > > { > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, > > * Returns the maximum amount of memory @mem can be charged with, in > > * pages. > > */ > > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > { > > unsigned long margin = 0; > > unsigned long count; > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc) > > if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) > > panic("System is deadlocked on memory\n"); > > } > > - if (oc->chosen && oc->chosen != (void *)-1UL) > > + if (oc->chosen && oc->chosen != (void *)-1UL) { > > + if (is_memcg_oom(oc)) { > > + /* > > + * If a memcg is now under its limit or current will be > > + * exiting and freeing memory, avoid needlessly killing > > + * chosen. > > + */ > > + if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) || > > + task_will_free_mem(current)) { > > + put_task_struct(oc->chosen); > > + return true; > > + } > > + } > > + > > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > > "Memory cgroup out of memory"); > > + } > > return !!oc->chosen; > > } > > > > > -- > Thanks > Yafang >
On Wed, Jul 15, 2020 at 10:44 AM David Rientjes <rientjes@google.com> wrote: > > On Wed, 15 Jul 2020, Yafang Shao wrote: > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > index 1962232..15e0e18 100644 > > > > --- a/mm/memcontrol.c > > > > +++ b/mm/memcontrol.c > > > > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > > .gfp_mask = gfp_mask, > > > > .order = order, > > > > }; > > > > - bool ret; > > > > + bool ret = true; > > > > > > > > if (mutex_lock_killable(&oom_lock)) > > > > return true; > > > > + > > > > + if (mem_cgroup_margin(memcg) >= (1 << order)) > > > > + goto unlock; > > > > + > > > > /* > > > > * A few threads which were not waiting at mutex_lock_killable() can > > > > * fail to bail out. Therefore, check again after holding oom_lock. > > > > */ > > > > ret = should_force_charge() || out_of_memory(&oc); > > > > + > > > > +unlock: > > > > mutex_unlock(&oom_lock); > > > > return ret; > > > > } > > > > > > Hi Yafang, > > > > > > We've run with a patch very much like this for several years and it works > > > quite successfully to prevent the unnecessary oom killing of processes. > > > > > > We do this in out_of_memory() directly, however, because we found that we > > > could prevent even *more* unnecessary killing if we checked this at the > > > "point of no return" because the selection of processes takes some > > > additional time when we might resolve the oom condition. > > > > > > > Hi David, > > > > Your proposal could also resolve the issue, > > It has successfully resolved it for several years in our kernel, we tried > an approach similiar to yours but saw many instances where memcg oom kills > continued to proceed even though the memcg information dumped to the > kernel log showed memory available. > > If this was a page or two that became available due to memory freeing, > it's not a significant difference. Instead, if this races with an oom > notification and a process exiting or being SIGKILL'd, it becomes much > harder to explain to a user why their process was oom killed when there > are tens of megabytes of memory available as shown by the kernel log (the > freeing/exiting happened during a particularly long iteration of processes > attached to the memcg, for example). > > That's what motivated a change to moving this to out_of_memory() directly, > we found that it prevented even more unnecessary oom kills, which is a > very good thing. It may only be easily observable and make a significant > difference at very large scale, however. > Thanks for the clarification. If it is the race which causes this issue and we want to reduce the race window, I don't know whether it is proper to check the memcg margin in out_of_memory() or do it before calling do_send_sig_info(). Because per my understanding, dump_header() always takes much more time than select_bad_process() especially if there're slow consoles. So the race might easily happen when doing dump_header() or dumping other information, but if we check the memcg margin after dumping this oom info, it would be strange to dump so much oom logs without killing a process. > > but I'm wondering why do > > it specifically for memcg oom? > > Doesn't it apply to global oom? > > For example, in the global oom, when selecting the processes, the > > others might free some pages and then it might allocate pages > > successfully. > > > > It's more complex because memory being allocated from the page allocator > must be physically contiguous, it's not a simple matter of comparing the > margin of available memory to the memory being charged. It could > theoretically be done but I haven't seen a use case where it has actually > mattered as opposed to memcg oom where it can happen quite readily at > scale. When memory is uncharged to a memcg because of large freeing or a > process exiting, that's immediately chargable by another process in the > same hierarchy because of its isolation as opposed to the page allocator > where that memory is up for grabs and anything on the system could > allocate it. > > > > Some may argue that this is unnecessarily exposing mem_cgroup_margin() to > > > generic mm code, but in the interest of preventing any unnecessary oom > > > kill we've found it to be helpful. > > > > > > I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089. > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > > > void mem_cgroup_split_huge_fixup(struct page *head); > > > #endif > > > > > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg); > > > + > > > #else /* CONFIG_MEMCG */ > > > > > > #define MEM_CGROUP_ID_SHIFT 0 > > > @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > > > { > > > } > > > > > > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > > +{ > > > +} > > > + > > > static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, > > > bool in_low_reclaim) > > > { > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, > > > * Returns the maximum amount of memory @mem can be charged with, in > > > * pages. > > > */ > > > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > > { > > > unsigned long margin = 0; > > > unsigned long count; > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc) > > > if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) > > > panic("System is deadlocked on memory\n"); > > > } > > > - if (oc->chosen && oc->chosen != (void *)-1UL) > > > + if (oc->chosen && oc->chosen != (void *)-1UL) { > > > + if (is_memcg_oom(oc)) { > > > + /* > > > + * If a memcg is now under its limit or current will be > > > + * exiting and freeing memory, avoid needlessly killing > > > + * chosen. > > > + */ > > > + if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) || > > > + task_will_free_mem(current)) { > > > + put_task_struct(oc->chosen); > > > + return true; > > > + } > > > + } > > > + > > > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > > > "Memory cgroup out of memory"); > > > + } > > > return !!oc->chosen; > > > } > > > > > > > > > -- > > Thanks > > Yafang > >
On Wed, 15 Jul 2020, Yafang Shao wrote: > > It has successfully resolved it for several years in our kernel, we tried > > an approach similiar to yours but saw many instances where memcg oom kills > > continued to proceed even though the memcg information dumped to the > > kernel log showed memory available. > > > > If this was a page or two that became available due to memory freeing, > > it's not a significant difference. Instead, if this races with an oom > > notification and a process exiting or being SIGKILL'd, it becomes much > > harder to explain to a user why their process was oom killed when there > > are tens of megabytes of memory available as shown by the kernel log (the > > freeing/exiting happened during a particularly long iteration of processes > > attached to the memcg, for example). > > > > That's what motivated a change to moving this to out_of_memory() directly, > > we found that it prevented even more unnecessary oom kills, which is a > > very good thing. It may only be easily observable and make a significant > > difference at very large scale, however. > > > > Thanks for the clarification. > > If it is the race which causes this issue and we want to reduce the > race window, I don't know whether it is proper to check the memcg > margin in out_of_memory() or do it before calling do_send_sig_info(). > Because per my understanding, dump_header() always takes much more > time than select_bad_process() especially if there're slow consoles. > So the race might easily happen when doing dump_header() or dumping > other information, but if we check the memcg margin after dumping this > oom info, it would be strange to dump so much oom logs without killing > a process. > Absolutely correct :) In my proposed patch, we declare dump_header() as the "point of no return" since we don't want to dump oom kill information to the kernel log when nothing is actually killed. We could abort at the very last minute, as you mention, but I think that may have an adverse impact on anything that cares about that log message.
On Wed, Jul 15, 2020 at 11:18 AM David Rientjes <rientjes@google.com> wrote: > > On Wed, 15 Jul 2020, Yafang Shao wrote: > > > > It has successfully resolved it for several years in our kernel, we tried > > > an approach similiar to yours but saw many instances where memcg oom kills > > > continued to proceed even though the memcg information dumped to the > > > kernel log showed memory available. > > > > > > If this was a page or two that became available due to memory freeing, > > > it's not a significant difference. Instead, if this races with an oom > > > notification and a process exiting or being SIGKILL'd, it becomes much > > > harder to explain to a user why their process was oom killed when there > > > are tens of megabytes of memory available as shown by the kernel log (the > > > freeing/exiting happened during a particularly long iteration of processes > > > attached to the memcg, for example). > > > > > > That's what motivated a change to moving this to out_of_memory() directly, > > > we found that it prevented even more unnecessary oom kills, which is a > > > very good thing. It may only be easily observable and make a significant > > > difference at very large scale, however. > > > > > > > Thanks for the clarification. > > > > If it is the race which causes this issue and we want to reduce the > > race window, I don't know whether it is proper to check the memcg > > margin in out_of_memory() or do it before calling do_send_sig_info(). > > Because per my understanding, dump_header() always takes much more > > time than select_bad_process() especially if there're slow consoles. > > So the race might easily happen when doing dump_header() or dumping > > other information, but if we check the memcg margin after dumping this > > oom info, it would be strange to dump so much oom logs without killing > > a process. > > > > Absolutely correct :) In my proposed patch, we declare dump_header() as > the "point of no return" since we don't want to dump oom kill information > to the kernel log when nothing is actually killed. We could abort at the > very last minute, as you mention, but I think that may have an adverse > impact on anything that cares about that log message. How about storing the memcg information in oom_control when the memcg oom is triggered, and then show this information in dump_header() ? IOW, the OOM info really shows the memcg status when oom occurs, rather than the memcg status when this info is printed.
On Wed 15-07-20 11:10:07, Yafang Shao wrote: [...] > If it is the race which causes this issue and we want to reduce the > race window, I don't know whether it is proper to check the memcg > margin in out_of_memory() or do it before calling do_send_sig_info(). > Because per my understanding, dump_header() always takes much more > time than select_bad_process() especially if there're slow consoles. > So the race might easily happen when doing dump_header() or dumping > other information, but if we check the memcg margin after dumping this > oom info, it would be strange to dump so much oom logs without killing > a process. Yes, this is my experience as well. Unless there are gazillions of tasks the oom victim selection should be reasonably swift. It is usually the oom report which takes the most time and this is a huge race window. So I think it would be best to go with your patch for now. It is more in line with the global oom flow and it is much easier to reason about.
On Wed, 15 Jul 2020, Yafang Shao wrote: > > > If it is the race which causes this issue and we want to reduce the > > > race window, I don't know whether it is proper to check the memcg > > > margin in out_of_memory() or do it before calling do_send_sig_info(). > > > Because per my understanding, dump_header() always takes much more > > > time than select_bad_process() especially if there're slow consoles. > > > So the race might easily happen when doing dump_header() or dumping > > > other information, but if we check the memcg margin after dumping this > > > oom info, it would be strange to dump so much oom logs without killing > > > a process. > > > > > > > Absolutely correct :) In my proposed patch, we declare dump_header() as > > the "point of no return" since we don't want to dump oom kill information > > to the kernel log when nothing is actually killed. We could abort at the > > very last minute, as you mention, but I think that may have an adverse > > impact on anything that cares about that log message. > > How about storing the memcg information in oom_control when the memcg > oom is triggered, and then show this information in dump_header() ? > IOW, the OOM info really shows the memcg status when oom occurs, > rather than the memcg status when this info is printed. > We actually do that too in our kernel but for slightly other reasons :) It's pretty interesting how a lot of our previous concerns with memcg oom killing have been echoed by you in this thread. But yes, we store vital information about the memcg at the time of the first oom event when the oom killer is disabled (to allow userspace to determine what the best course of action is). But regardless of whether we present previous data to the user in the kernel log or not, we've determined that oom killing a process is a serious matter and go to any lengths possible to avoid having to do it. For us, that means waiting until the "point of no return" to either go ahead with oom killing a process or aborting and retrying the charge. I don't think moving the mem_cgroup_margin() check to out_of_memory() right before printing the oom info and killing the process is a very invasive patch. Any strong preference against doing it that way? I think moving the check as late as possible to save a process from being killed when racing with an exiter or killed process (including perhaps current) has a pretty clear motivation.
On Thu, Jul 16, 2020 at 1:30 AM David Rientjes <rientjes@google.com> wrote: > > On Wed, 15 Jul 2020, Yafang Shao wrote: > > > > > If it is the race which causes this issue and we want to reduce the > > > > race window, I don't know whether it is proper to check the memcg > > > > margin in out_of_memory() or do it before calling do_send_sig_info(). > > > > Because per my understanding, dump_header() always takes much more > > > > time than select_bad_process() especially if there're slow consoles. > > > > So the race might easily happen when doing dump_header() or dumping > > > > other information, but if we check the memcg margin after dumping this > > > > oom info, it would be strange to dump so much oom logs without killing > > > > a process. > > > > > > > > > > Absolutely correct :) In my proposed patch, we declare dump_header() as > > > the "point of no return" since we don't want to dump oom kill information > > > to the kernel log when nothing is actually killed. We could abort at the > > > very last minute, as you mention, but I think that may have an adverse > > > impact on anything that cares about that log message. > > > > How about storing the memcg information in oom_control when the memcg > > oom is triggered, and then show this information in dump_header() ? > > IOW, the OOM info really shows the memcg status when oom occurs, > > rather than the memcg status when this info is printed. > > > > We actually do that too in our kernel but for slightly other reasons :) > It's pretty interesting how a lot of our previous concerns with memcg oom > killing have been echoed by you in this thread. These should be common concerns of container users :) I'm a heavy container user for now. > But yes, we store vital > information about the memcg at the time of the first oom event when the > oom killer is disabled (to allow userspace to determine what the best > course of action is). > It would be better if you could upstream the features in your kernel, and I think it could also help the other users. > But regardless of whether we present previous data to the user in the > kernel log or not, we've determined that oom killing a process is a > serious matter and go to any lengths possible to avoid having to do it. > For us, that means waiting until the "point of no return" to either go > ahead with oom killing a process or aborting and retrying the charge. > > I don't think moving the mem_cgroup_margin() check to out_of_memory() > right before printing the oom info and killing the process is a very > invasive patch. Any strong preference against doing it that way? I think > moving the check as late as possible to save a process from being killed > when racing with an exiter or killed process (including perhaps current) > has a pretty clear motivation. I understand what you mean "point of no return", but that seems a workaround rather than a fix. If you don't want to kill unnecessary processes, then checking the memcg margin before sending sigkill is better, because as I said before the race will be most likely happening in dump_header(). If you don't want to show strange OOM information like "your process was oom killed and it shows usage is 60MB in a memcg limited to 100MB", it is better to get the snapshot of the OOM when it is triggered and then show it later, and I think it could also apply to the global OOM. While my patch means to fix the issue caused by parallel OOM, because the others are waiting oom_lock while one process is doing OOM. And as explained by Michal before, it is more in line with the global oom flow and it is much easier to reason about.
On 2020/07/16 2:30, David Rientjes wrote: > But regardless of whether we present previous data to the user in the > kernel log or not, we've determined that oom killing a process is a > serious matter and go to any lengths possible to avoid having to do it. > For us, that means waiting until the "point of no return" to either go > ahead with oom killing a process or aborting and retrying the charge. > > I don't think moving the mem_cgroup_margin() check to out_of_memory() > right before printing the oom info and killing the process is a very > invasive patch. Any strong preference against doing it that way? I think > moving the check as late as possible to save a process from being killed > when racing with an exiter or killed process (including perhaps current) > has a pretty clear motivation. > How about ignoring MMF_OOM_SKIP for once? I think this has almost same effect as moving the mem_cgroup_margin() check to out_of_memory() right before printing the oom info and killing the process. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 48e0db54d838..88170af3b9eb 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -322,7 +322,8 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) * any memory is quite low. */ if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) { - if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) + if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags) && + !test_and_clear_bit(MMF_OOM_REAP_QUEUED, &task->signal->oom_mm->flags)) goto next; goto abort; } @@ -658,7 +659,8 @@ static int oom_reaper(void *unused) static void wake_oom_reaper(struct task_struct *tsk) { /* mm is already queued? */ - if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) + if (test_bit(MMF_OOM_SKIP, &tsk->signal->oom_mm->flags) || + test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) return; get_task_struct(tsk);
On Wed 15-07-20 10:30:51, David Rientjes wrote: [...] > I don't think moving the mem_cgroup_margin() check to out_of_memory() > right before printing the oom info and killing the process is a very > invasive patch. Any strong preference against doing it that way? I think > moving the check as late as possible to save a process from being killed > when racing with an exiter or killed process (including perhaps current) > has a pretty clear motivation. We have been through this discussion several times in the past IIRC The conclusion has been that the allocator (charging path for the memcg) is the one to define OOM situation. This is an inherently racy situation as long as we are not synchronizing oom with the world, which I believe we agree, we do not want to do. There are few exceptions to bail out early from the oom under certain situations and the trend was to remove some of the existing ones rather than adding new because they had subtle side effects and were prone to lockups. As much as it might sound attractive to move mem_cgroup_margin resp. last allocation attempt closer to the actual oom killing I haven't seen any convincing data that would support that such a change would make a big difference. select_bad_process is not a free operation as it scales with the number of tasks in the oom domain but it shouldn't be a super expensive. The oom reporting is by far the most expensive part of the operation. That being said, really convincing data should be presented in order to do such a change. I do not think we want to do that just in case.
On Thu 16-07-20 14:54:01, Tetsuo Handa wrote: > On 2020/07/16 2:30, David Rientjes wrote: > > But regardless of whether we present previous data to the user in the > > kernel log or not, we've determined that oom killing a process is a > > serious matter and go to any lengths possible to avoid having to do it. > > For us, that means waiting until the "point of no return" to either go > > ahead with oom killing a process or aborting and retrying the charge. > > > > I don't think moving the mem_cgroup_margin() check to out_of_memory() > > right before printing the oom info and killing the process is a very > > invasive patch. Any strong preference against doing it that way? I think > > moving the check as late as possible to save a process from being killed > > when racing with an exiter or killed process (including perhaps current) > > has a pretty clear motivation. > > > > How about ignoring MMF_OOM_SKIP for once? I think this has almost same > effect as moving the mem_cgroup_margin() check to out_of_memory() > right before printing the oom info and killing the process. How would that help with races when a task is exiting while the oom selects a victim? We are not talking about races with the oom_reaper IIUC. Btw. if races with the oom_reaper are a concern then I would much rather delay the wake up than complicate the existing protocol even further.
On Thu, 16 Jul 2020, Michal Hocko wrote: > > I don't think moving the mem_cgroup_margin() check to out_of_memory() > > right before printing the oom info and killing the process is a very > > invasive patch. Any strong preference against doing it that way? I think > > moving the check as late as possible to save a process from being killed > > when racing with an exiter or killed process (including perhaps current) > > has a pretty clear motivation. > > We have been through this discussion several times in the past IIRC > The conclusion has been that the allocator (charging path for > the memcg) is the one to define OOM situation. This is an inherently > racy situation as long as we are not synchronizing oom with the world, > which I believe we agree, we do not want to do. There are few exceptions > to bail out early from the oom under certain situations and the trend > was to remove some of the existing ones rather than adding new because > they had subtle side effects and were prone to lockups. > > As much as it might sound attractive to move mem_cgroup_margin resp. > last allocation attempt closer to the actual oom killing I haven't seen > any convincing data that would support that such a change would make a > big difference. select_bad_process is not a free operation as it scales > with the number of tasks in the oom domain but it shouldn't be a super > expensive. The oom reporting is by far the most expensive part of the > operation. > > That being said, really convincing data should be presented in order > to do such a change. I do not think we want to do that just in case. It's not possible to present data because we've had such a check for years in our fleet so I can't say that it has prevented X unnecessary oom kills compared to doing the check prior to calling out_of_memory(). I'm hoping that can be understood. Since Yafang is facing the same issue, and there is no significant downside to doing the mem_cgroup_margin() check prior to oom_kill_process() (or checking task_will_free_mem(current)), and it's acknowledged that it *can* prevent unnecessary oom killing, which is a very good thing, I'd like to understand why such resistance to it. Killing a user process is a serious matter. I would fully agree if the margin is only one page: it's still better to kill something off. But when a process has uncharged memory by means induced by a process waiting on oom notication, such as a userspace kill or dropping of caches from your malloc implementation, that uncharge can be quite substantial and oom killing is then unnecessary. I can refresh the patch and send it formally.
On Thu, 16 Jul 2020, Yafang Shao wrote: > > But yes, we store vital > > information about the memcg at the time of the first oom event when the > > oom killer is disabled (to allow userspace to determine what the best > > course of action is). > > > > It would be better if you could upstream the features in your kernel, > and I think it could also help the other users. > Everything we've discussed so far has been proposed in the past, actually. I think we stress the oom killer and use it at scale that others do not, so only a subset of users would find it interesting. You are very likely one of those subset of users. We should certainly talk about other issues that we have run into that make the upstream oom killer unusable. Are there other areas that you're currently focused on or having trouble with? I'd be happy to have a discussion on how we have resolved a lot of its issues. > I understand what you mean "point of no return", but that seems a > workaround rather than a fix. > If you don't want to kill unnecessary processes, then checking the > memcg margin before sending sigkill is better, because as I said > before the race will be most likely happening in dump_header(). > If you don't want to show strange OOM information like "your process > was oom killed and it shows usage is 60MB in a memcg limited > to 100MB", it is better to get the snapshot of the OOM when it is > triggered and then show it later, and I think it could also apply to > the global OOM. > It's likely a misunderstanding: I wasn't necessarily concerned about showing 60MB in a memcg limited to 100MB, that part we can deal with, the concern was after dumping all of that great information that instead of getting a "Killed process..." we get a "Oh, there's memory now, just kidding about everything we just dumped" ;) We could likely enlighten userspace about that so that we don't consider that to be an actual oom kill. But I also very much agree that after dump_header() would be appropriate as well since the goal is to prevent unnecessary oom killing. Would you mind sending a patch to check mem_cgroup_margin() on is_memcg_oom() prior to sending the SIGKILL to the victim and printing the "Killed process..." line? We'd need a line that says "xKB of memory now available -- suppressing oom kill" or something along those lines so userspace understands what happened. But the memory info that it emits both for the state of the memcg and system RAM may also be helpful to understand why we got to the oom kill in the first place, which is also a good thing. I'd happy ack that patch since it would be a comprehensive solution that avoids oom kill of user processes at all costs, which is a goal I think we can all rally behind.
On Thu, 16 Jul 2020, Michal Hocko wrote: > > > But regardless of whether we present previous data to the user in the > > > kernel log or not, we've determined that oom killing a process is a > > > serious matter and go to any lengths possible to avoid having to do it. > > > For us, that means waiting until the "point of no return" to either go > > > ahead with oom killing a process or aborting and retrying the charge. > > > > > > I don't think moving the mem_cgroup_margin() check to out_of_memory() > > > right before printing the oom info and killing the process is a very > > > invasive patch. Any strong preference against doing it that way? I think > > > moving the check as late as possible to save a process from being killed > > > when racing with an exiter or killed process (including perhaps current) > > > has a pretty clear motivation. > > > > > > > How about ignoring MMF_OOM_SKIP for once? I think this has almost same > > effect as moving the mem_cgroup_margin() check to out_of_memory() > > right before printing the oom info and killing the process. > > How would that help with races when a task is exiting while the oom > selects a victim? We are not talking about races with the oom_reaper > IIUC. Btw. if races with the oom_reaper are a concern then I would much > rather delay the wake up than complicate the existing protocol even > further. Right, this isn't a concern about racing with oom reaping or when finding processes that have already been selected as the oom victim. This is about (potentially significant) amounts of memory that has been uncharged to the memcg hierarchy between the failure of reclaim to uncharge memory and the actual killing of a user process.
On Wed 15-07-20 23:56:11, David Rientjes wrote: > On Thu, 16 Jul 2020, Michal Hocko wrote: > > > > I don't think moving the mem_cgroup_margin() check to out_of_memory() > > > right before printing the oom info and killing the process is a very > > > invasive patch. Any strong preference against doing it that way? I think > > > moving the check as late as possible to save a process from being killed > > > when racing with an exiter or killed process (including perhaps current) > > > has a pretty clear motivation. > > > > We have been through this discussion several times in the past IIRC > > The conclusion has been that the allocator (charging path for > > the memcg) is the one to define OOM situation. This is an inherently > > racy situation as long as we are not synchronizing oom with the world, > > which I believe we agree, we do not want to do. There are few exceptions > > to bail out early from the oom under certain situations and the trend > > was to remove some of the existing ones rather than adding new because > > they had subtle side effects and were prone to lockups. > > > > As much as it might sound attractive to move mem_cgroup_margin resp. > > last allocation attempt closer to the actual oom killing I haven't seen > > any convincing data that would support that such a change would make a > > big difference. select_bad_process is not a free operation as it scales > > with the number of tasks in the oom domain but it shouldn't be a super > > expensive. The oom reporting is by far the most expensive part of the > > operation. > > > > That being said, really convincing data should be presented in order > > to do such a change. I do not think we want to do that just in case. > > It's not possible to present data because we've had such a check for years > in our fleet so I can't say that it has prevented X unnecessary oom kills > compared to doing the check prior to calling out_of_memory(). I'm hoping > that can be understood. > > Since Yafang is facing the same issue, and there is no significant > downside to doing the mem_cgroup_margin() check prior to > oom_kill_process() (or checking task_will_free_mem(current)), and it's > acknowledged that it *can* prevent unnecessary oom killing, which is a > very good thing, I'd like to understand why such resistance to it. Because exactly this kind of arguments has led to quite some "should be fine" heuristics which kicked back: do not kill exiting task, sacrifice child instead of a victim just to name few. All of them make some sense from a glance but they can serious kick back as the experience has thought us. Really, I do not see what is so hard to understand that each heuristic, especially those to subtle areas like oom definitely is, needs data to justify them. We are running this for years is really not an argument. Sure arguing that your workload leads to x amount of false positives and just shifting the check to later saves y amount of them sounds like a relevant argument to me. > Killing a user process is a serious matter. It definitely is and I believe that nobody is questioning that. The oom situation is a serious matter on its own. It says that the system has failed to balance the memory consumption and the oom killer is the _last_ resort action to be taken. > I would fully agree if the > margin is only one page: it's still better to kill something off. And exactly these kinds of details make any heuristic subtle and hard maintain. > But > when a process has uncharged memory by means induced by a process waiting > on oom notication, such as a userspace kill or dropping of caches from > your malloc implementation, that uncharge can be quite substantial and oom > killing is then unnecessary. That should happen quite some time before the hard limit is reached and we have means to achieve that. High limit is there to help with pro-active reclaim before the OOM happens on the hard limit. If you are stuck with the v1 then oom disable and shifting the whole logic into the userspace is another variant. > I can refresh the patch and send it formally.
On Thu, Jul 16, 2020 at 3:04 PM David Rientjes <rientjes@google.com> wrote: > > On Thu, 16 Jul 2020, Yafang Shao wrote: > > > > But yes, we store vital > > > information about the memcg at the time of the first oom event when the > > > oom killer is disabled (to allow userspace to determine what the best > > > course of action is). > > > > > > > It would be better if you could upstream the features in your kernel, > > and I think it could also help the other users. > > > > Everything we've discussed so far has been proposed in the past, actually. > I think we stress the oom killer and use it at scale that others do not, > so only a subset of users would find it interesting. You are very likely > one of those subset of users. > > We should certainly talk about other issues that we have run into that > make the upstream oom killer unusable. Are there other areas that you're > currently focused on or having trouble with? I'd be happy to have a > discussion on how we have resolved a lot of its issues. > > > I understand what you mean "point of no return", but that seems a > > workaround rather than a fix. > > If you don't want to kill unnecessary processes, then checking the > > memcg margin before sending sigkill is better, because as I said > > before the race will be most likely happening in dump_header(). > > If you don't want to show strange OOM information like "your process > > was oom killed and it shows usage is 60MB in a memcg limited > > to 100MB", it is better to get the snapshot of the OOM when it is > > triggered and then show it later, and I think it could also apply to > > the global OOM. > > > > It's likely a misunderstanding: I wasn't necessarily concerned about > showing 60MB in a memcg limited to 100MB, that part we can deal with, the > concern was after dumping all of that great information that instead of > getting a "Killed process..." we get a "Oh, there's memory now, just > kidding about everything we just dumped" ;) > Actually the kernel is doing it now, see bellow, dump_header() <<<< dump lots of information __oom_kill_process p = find_lock_task_mm(victim); if (!p) return; <<<< without killing any process. > We could likely enlighten userspace about that so that we don't consider > that to be an actual oom kill. But I also very much agree that after > dump_header() would be appropriate as well since the goal is to prevent > unnecessary oom killing. > > Would you mind sending a patch to check mem_cgroup_margin() on > is_memcg_oom() prior to sending the SIGKILL to the victim and printing the > "Killed process..." line? We'd need a line that says "xKB of memory now > available -- suppressing oom kill" or something along those lines so > userspace understands what happened. But the memory info that it emits > both for the state of the memcg and system RAM may also be helpful to > understand why we got to the oom kill in the first place, which is also a > good thing. > > I'd happy ack that patch since it would be a comprehensive solution that > avoids oom kill of user processes at all costs, which is a goal I think we > can all rally behind. I'd prefer to put dump_header() behind do_send_sig_info(), for example, __oom_kill_process() do_send_sig_info() dump_header() <<<< may better put it behind wake_oom_reaper(), but it may loses some information to dump... pr_err("%s: Killed process %d (%s)....") Because the main goal of OOM is to kill a process to free pages ASAP to avoid system stall or memcg stall. We all find that dump_header() may take a long time to finish especially if there is a slow console, and this long time may cause a great system stall, so we'd better defer the dump of it. But that should be another topic.
On Thu 16-07-20 19:53:12, Yafang Shao wrote: [...] > I'd prefer to put dump_header() behind do_send_sig_info(), for example, > > __oom_kill_process() > do_send_sig_info() > dump_header() <<<< may better put it behind wake_oom_reaper(), but > it may loses some information to dump... > pr_err("%s: Killed process %d (%s)....") > > Because the main goal of OOM is to kill a process to free pages ASAP > to avoid system stall or memcg stall. > We all find that dump_header() may take a long time to finish > especially if there is a slow console, and this long time may cause a > great system stall, so we'd better defer the dump of it. I would be worried that such a memory dump would be very likely not reflecting the actual OOM situation and thus hard to use for analysis. Why? Because the killed oom victim can _usually_ die or the oom reaper can free up a lot of memory very quickly. If we look at it the main part of the report which takes long to dump is dump_tasks because this is potentially a lot of data. I haven't seen show_mem or its memcg counterpart to take a long time. We've discussed how much dump_tasks is really useful for production systems and the general feedback was that it should be enabled by default, though. If the oom report should be misleading then it would be better to not bother at all IMHO. From my experience the memory state is really useful when debugging ooms and having a reasonable snapshot is really important. IIRC Tetsuo was suggesting collecting all the information at the time of the oom and print that snapshot later on but the patch was quite invasive and didn't handle multiple OOMs from different oom domains - especially the dump_task part would need to link all the tasks for different oom contexts. Anyway, I think we are getting off topic here.
On 2020/07/16 21:21, Michal Hocko wrote: > If the oom report should be misleading then it would be better to not > bother at all IMHO. From my experience the memory state is really useful > when debugging ooms and having a reasonable snapshot is really > important. IIRC Tetsuo was suggesting collecting all the information at > the time of the oom and print that snapshot later on but the patch was > quite invasive and didn't handle multiple OOMs from different oom > domains - especially the dump_task part would need to link all the tasks > for different oom contexts. I did, and you just refused as always. https://lore.kernel.org/linux-mm/20190830103504.GA28313@dhcp22.suse.cz/ https://lore.kernel.org/linux-mm/20190909113627.GJ27159@dhcp22.suse.cz/
On Thu, 16 Jul 2020, Yafang Shao wrote: > > It's likely a misunderstanding: I wasn't necessarily concerned about > > showing 60MB in a memcg limited to 100MB, that part we can deal with, the > > concern was after dumping all of that great information that instead of > > getting a "Killed process..." we get a "Oh, there's memory now, just > > kidding about everything we just dumped" ;) > > > > Actually the kernel is doing it now, see bellow, > > dump_header() <<<< dump lots of information > __oom_kill_process > p = find_lock_task_mm(victim); > if (!p) > return; <<<< without killing any process. > Ah, this is catching an instance where the chosen process has already done exit_mm(), good catch -- I can find examples of this by scraping kernel logs from our fleet. So it appears there is precedence for dumping all the oom info but not actually performing any action for it and I made the earlier point that diagnostic information in the kernel log here is still useful. I think it is still preferable that the kernel at least tell us why it didn't do anything, but as you mention that already happens today. Would you like to send a patch that checks for mem_cgroup_margin() here as well? A second patch could make the possible inaction more visibile, something like "Process ${pid} (${comm}) is already exiting" for the above check or "Memcg ${memcg} is no longer out of memory". Another thing that these messages indicate, beyond telling us why the oom killer didn't actually SIGKILL anything, is that we can expect some skew in the memory stats that shows an availability of memory. > > > We could likely enlighten userspace about that so that we don't consider > > that to be an actual oom kill. But I also very much agree that after > > dump_header() would be appropriate as well since the goal is to prevent > > unnecessary oom killing. > > > > Would you mind sending a patch to check mem_cgroup_margin() on > > is_memcg_oom() prior to sending the SIGKILL to the victim and printing the > > "Killed process..." line? We'd need a line that says "xKB of memory now > > available -- suppressing oom kill" or something along those lines so > > userspace understands what happened. But the memory info that it emits > > both for the state of the memcg and system RAM may also be helpful to > > understand why we got to the oom kill in the first place, which is also a > > good thing. > > > > I'd happy ack that patch since it would be a comprehensive solution that > > avoids oom kill of user processes at all costs, which is a goal I think we > > can all rally behind. > > I'd prefer to put dump_header() behind do_send_sig_info(), for example, > > __oom_kill_process() > do_send_sig_info() > dump_header() <<<< may better put it behind wake_oom_reaper(), but > it may loses some information to dump... > pr_err("%s: Killed process %d (%s)....") > I agree with Michal here that dump_header() after the actual kill would no longer represent the state of the system (or cpuset or memcg, depending on context) at the time of the oom kill so it's best to dump relevant information before the actual kill.
On Thu, 16 Jul 2020, Michal Hocko wrote: > > It's not possible to present data because we've had such a check for years > > in our fleet so I can't say that it has prevented X unnecessary oom kills > > compared to doing the check prior to calling out_of_memory(). I'm hoping > > that can be understood. > > > > Since Yafang is facing the same issue, and there is no significant > > downside to doing the mem_cgroup_margin() check prior to > > oom_kill_process() (or checking task_will_free_mem(current)), and it's > > acknowledged that it *can* prevent unnecessary oom killing, which is a > > very good thing, I'd like to understand why such resistance to it. > > Because exactly this kind of arguments has led to quite some "should be > fine" heuristics which kicked back: do not kill exiting task, sacrifice > child instead of a victim just to name few. All of them make some sense > from a glance but they can serious kick back as the experience has > thought us. > > Really, I do not see what is so hard to understand that each heuristic, > especially those to subtle areas like oom definitely is, needs data to > justify them. We are running this for years is really not an argument. > Sure arguing that your workload leads to x amount of false positives > and just shifting the check to later saves y amount of them sounds like > a relevant argument to me. > Deferring the go/no-go decision on the oom kill to the very last moment doesn't seem like a heuristic, I think it's an inherent responsibility of the kernel to do whatever necessary to prevent a userspace process from being oom killed (and the way to solve Yafang's issue that we had solved years ago). That can be done by closing the window as much as possible (including within out_of_memory()) to reduce the likelihood of unnecessary oom killing. It's intuitive and seems rather trivial. I would understand an argument against such an approach if it added elaborate complexity, but this isn't doing so. If the decision was already made in oom_kill_process(), I don't think anybody would advocate for moving it below out_of_memory() to its current state. We aren't losing anything here, we are only preventing unnecessary oom killing that has caused issues for Yafang as well as us. Any solution that does a mem_cgroup_margin() check before out_of_memory() in the memcg path is closing that window a little bit, but I think we can do better.
On Fri, Jul 17, 2020 at 3:53 AM David Rientjes <rientjes@google.com> wrote: > > On Thu, 16 Jul 2020, Yafang Shao wrote: > > > > It's likely a misunderstanding: I wasn't necessarily concerned about > > > showing 60MB in a memcg limited to 100MB, that part we can deal with, the > > > concern was after dumping all of that great information that instead of > > > getting a "Killed process..." we get a "Oh, there's memory now, just > > > kidding about everything we just dumped" ;) > > > > > > > Actually the kernel is doing it now, see bellow, > > > > dump_header() <<<< dump lots of information > > __oom_kill_process > > p = find_lock_task_mm(victim); > > if (!p) > > return; <<<< without killing any process. > > > > Ah, this is catching an instance where the chosen process has already done > exit_mm(), good catch -- I can find examples of this by scraping kernel > logs from our fleet. > > So it appears there is precedence for dumping all the oom info but not > actually performing any action for it and I made the earlier point that > diagnostic information in the kernel log here is still useful. I think it > is still preferable that the kernel at least tell us why it didn't do > anything, but as you mention that already happens today. > > Would you like to send a patch that checks for mem_cgroup_margin() here as > well? A second patch could make the possible inaction more visibile, > something like "Process ${pid} (${comm}) is already exiting" for the above > check or "Memcg ${memcg} is no longer out of memory". > > Another thing that these messages indicate, beyond telling us why the oom > killer didn't actually SIGKILL anything, is that we can expect some skew > in the memory stats that shows an availability of memory. > Agreed, these messages would be helpful. I will send a patch for it. > > > > > We could likely enlighten userspace about that so that we don't consider > > > that to be an actual oom kill. But I also very much agree that after > > > dump_header() would be appropriate as well since the goal is to prevent > > > unnecessary oom killing. > > > > > > Would you mind sending a patch to check mem_cgroup_margin() on > > > is_memcg_oom() prior to sending the SIGKILL to the victim and printing the > > > "Killed process..." line? We'd need a line that says "xKB of memory now > > > available -- suppressing oom kill" or something along those lines so > > > userspace understands what happened. But the memory info that it emits > > > both for the state of the memcg and system RAM may also be helpful to > > > understand why we got to the oom kill in the first place, which is also a > > > good thing. > > > > > > I'd happy ack that patch since it would be a comprehensive solution that > > > avoids oom kill of user processes at all costs, which is a goal I think we > > > can all rally behind. > > > > I'd prefer to put dump_header() behind do_send_sig_info(), for example, > > > > __oom_kill_process() > > do_send_sig_info() > > dump_header() <<<< may better put it behind wake_oom_reaper(), but > > it may loses some information to dump... > > pr_err("%s: Killed process %d (%s)....") > > > > I agree with Michal here that dump_header() after the actual kill would no > longer represent the state of the system (or cpuset or memcg, depending on > context) at the time of the oom kill so it's best to dump relevant > information before the actual kill.
On Fri, 17 Jul 2020, Yafang Shao wrote: > > > Actually the kernel is doing it now, see bellow, > > > > > > dump_header() <<<< dump lots of information > > > __oom_kill_process > > > p = find_lock_task_mm(victim); > > > if (!p) > > > return; <<<< without killing any process. > > > > > > > Ah, this is catching an instance where the chosen process has already done > > exit_mm(), good catch -- I can find examples of this by scraping kernel > > logs from our fleet. > > > > So it appears there is precedence for dumping all the oom info but not > > actually performing any action for it and I made the earlier point that > > diagnostic information in the kernel log here is still useful. I think it > > is still preferable that the kernel at least tell us why it didn't do > > anything, but as you mention that already happens today. > > > > Would you like to send a patch that checks for mem_cgroup_margin() here as > > well? A second patch could make the possible inaction more visibile, > > something like "Process ${pid} (${comm}) is already exiting" for the above > > check or "Memcg ${memcg} is no longer out of memory". > > > > Another thing that these messages indicate, beyond telling us why the oom > > killer didn't actually SIGKILL anything, is that we can expect some skew > > in the memory stats that shows an availability of memory. > > > > Agreed, these messages would be helpful. > I will send a patch for it. > Thanks Yafang. We should also continue talking about challenges you encounter with the oom killer either at the system level or for memcg limit ooms in a separate thread. It's clear that you are meeting several of the issues that we have previously seen ourselves. I could do a full audit of all our oom killer changes that may be interesting to you, but off the top of my head: - A means of triggering a memcg oom through the kernel: think of sysrq+f but scoped to processes attached to a memcg hierarchy. This allows userspace to reliably oom kill processes on overcommitted systems (SIGKILL can be insufficient if we depend on oom reaping, for example, to make forward progress) - Storing the state of a memcg's memory at the time reclaim has failed and we must oom kill: when the memcg oom killer is disabled so that userspace can handle it, if it triggers an oom kill through the kernel because it prefers an oom kill on an overcommitted system, we need to dump the state of the memory at oom rather than with the stack of the explicit trigger - Supplement memcg oom notification with an additional notification event on kernel oom kill: allows users to register for an event that triggers when the kernel oom killer kills something (and keeps a count of these events available for read) - Add a notion of an oom delay: on overcommitted systems, userspace may become unreliable or unresponsive despite our best efforts, this supplements the ability to disable the oom killer for a memcg hierarchy with the ability to disable it for a set period of time until the oom killer intervenes and kills something (last ditch effort). I'd be happy to discuss any of these topics if you are interested.
On Sat, Jul 18, 2020 at 3:26 AM David Rientjes <rientjes@google.com> wrote: > > On Fri, 17 Jul 2020, Yafang Shao wrote: > > > > > Actually the kernel is doing it now, see bellow, > > > > > > > > dump_header() <<<< dump lots of information > > > > __oom_kill_process > > > > p = find_lock_task_mm(victim); > > > > if (!p) > > > > return; <<<< without killing any process. > > > > > > > > > > Ah, this is catching an instance where the chosen process has already done > > > exit_mm(), good catch -- I can find examples of this by scraping kernel > > > logs from our fleet. > > > > > > So it appears there is precedence for dumping all the oom info but not > > > actually performing any action for it and I made the earlier point that > > > diagnostic information in the kernel log here is still useful. I think it > > > is still preferable that the kernel at least tell us why it didn't do > > > anything, but as you mention that already happens today. > > > > > > Would you like to send a patch that checks for mem_cgroup_margin() here as > > > well? A second patch could make the possible inaction more visibile, > > > something like "Process ${pid} (${comm}) is already exiting" for the above > > > check or "Memcg ${memcg} is no longer out of memory". > > > > > > Another thing that these messages indicate, beyond telling us why the oom > > > killer didn't actually SIGKILL anything, is that we can expect some skew > > > in the memory stats that shows an availability of memory. > > > > > > > Agreed, these messages would be helpful. > > I will send a patch for it. > > > > Thanks Yafang. We should also continue talking about challenges you > encounter with the oom killer either at the system level or for memcg > limit ooms in a separate thread. It's clear that you are meeting several > of the issues that we have previously seen ourselves. > > I could do a full audit of all our oom killer changes that may be > interesting to you, but off the top of my head: > > - A means of triggering a memcg oom through the kernel: think of sysrq+f > but scoped to processes attached to a memcg hierarchy. This allows > userspace to reliably oom kill processes on overcommitted systems > (SIGKILL can be insufficient if we depend on oom reaping, for example, > to make forward progress) > memcg sysrq+f would be helpful. But I'm wondering how about waking up the oom_reaper when we send SIGKILL to a process ? For the below three proposals, I think they would be helpful as well and I don't have different opinions。 > - Storing the state of a memcg's memory at the time reclaim has failed > and we must oom kill: when the memcg oom killer is disabled so that > userspace can handle it, if it triggers an oom kill through the kernel > because it prefers an oom kill on an overcommitted system, we need to > dump the state of the memory at oom rather than with the stack of the > explicit trigger > > - Supplement memcg oom notification with an additional notification event > on kernel oom kill: allows users to register for an event that triggers > when the kernel oom killer kills something (and keeps a count of these > events available for read) > > - Add a notion of an oom delay: on overcommitted systems, userspace may > become unreliable or unresponsive despite our best efforts, this > supplements the ability to disable the oom killer for a memcg hierarchy > with the ability to disable it for a set period of time until the oom > killer intervenes and kills something (last ditch effort). > > I'd be happy to discuss any of these topics if you are interested. Pls. send these patches at your convenience.
On Wed, Jul 15, 2020 at 11:56:11PM -0700, David Rientjes wrote: > On Thu, 16 Jul 2020, Michal Hocko wrote: > > > > I don't think moving the mem_cgroup_margin() check to out_of_memory() > > > right before printing the oom info and killing the process is a very > > > invasive patch. Any strong preference against doing it that way? I think > > > moving the check as late as possible to save a process from being killed > > > when racing with an exiter or killed process (including perhaps current) > > > has a pretty clear motivation. > > > > We have been through this discussion several times in the past IIRC > > The conclusion has been that the allocator (charging path for > > the memcg) is the one to define OOM situation. This is an inherently > > racy situation as long as we are not synchronizing oom with the world, > > which I believe we agree, we do not want to do. There are few exceptions > > to bail out early from the oom under certain situations and the trend > > was to remove some of the existing ones rather than adding new because > > they had subtle side effects and were prone to lockups. > > > > As much as it might sound attractive to move mem_cgroup_margin resp. > > last allocation attempt closer to the actual oom killing I haven't seen > > any convincing data that would support that such a change would make a > > big difference. select_bad_process is not a free operation as it scales > > with the number of tasks in the oom domain but it shouldn't be a super > > expensive. The oom reporting is by far the most expensive part of the > > operation. > > > > That being said, really convincing data should be presented in order > > to do such a change. I do not think we want to do that just in case. > > It's not possible to present data because we've had such a check for years > in our fleet so I can't say that it has prevented X unnecessary oom kills > compared to doing the check prior to calling out_of_memory(). I'm hoping > that can be understood. That makes sense. I would just be curious whether you can remember what data points you looked at at the time you made the change. Were you inspecting individual OOM kills and saw in the memory dump after the fact that the kill would have been unnecessary? Or were you looking at aggregate OOM kill rates in your fleet and saw a measurable reduction? > Since Yafang is facing the same issue, and there is no significant > downside to doing the mem_cgroup_margin() check prior to > oom_kill_process() (or checking task_will_free_mem(current)), and it's > acknowledged that it *can* prevent unnecessary oom killing, which is a > very good thing, I'd like to understand why such resistance to it. As someone who has been fairly vocal against this in the past, I have to admit I don't feel too strongly about it anymore. My argument in the past has been that if you're going for a probabilistic reduction of OOM kills, injecting sleeps in between reclaim and the last allocation attempt could also increase the chance of coincidental parallel memory frees. And that just always seemed somewhat arbitrary to me. I was also worried we'd be opening a can of worms by allowing this type of tweaking of the OOM killer, when OOM kills are supposed to be a once-in-a-blue-moon deadlock breaker, rather than a common resource enforcement strategy. However, I also have to admit that while artificial sleeps would indeed be pretty arbitrary and likely controversial, the time it takes to select a victim is unavoidable. It's something we need to do in any case. If checking the margin one last time after that helps you bring down the kill rate somewhat, I'd say go for it. It's a natural line, and I would agree that the change isn't very invasive. > Killing a user process is a serious matter. I would fully agree if the > margin is only one page: it's still better to kill something off. But > when a process has uncharged memory by means induced by a process waiting > on oom notication, such as a userspace kill or dropping of caches from > your malloc implementation, that uncharge can be quite substantial and oom > killing is then unnecessary. > > I can refresh the patch and send it formally. No objection from me.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1962232..15e0e18 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, .gfp_mask = gfp_mask, .order = order, }; - bool ret; + bool ret = true; if (mutex_lock_killable(&oom_lock)) return true; + + if (mem_cgroup_margin(memcg) >= (1 << order)) + goto unlock; + /* * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. */ ret = should_force_charge() || out_of_memory(&oc); + +unlock: mutex_unlock(&oom_lock); return ret; }
Memcg oom killer invocation is synchronized by the global oom_lock and tasks are sleeping on the lock while somebody is selecting the victim or potentially race with the oom_reaper is releasing the victim's memory. This can result in a pointless oom killer invocation because a waiter might be racing with the oom_reaper P1 oom_reaper P2 oom_reap_task mutex_lock(oom_lock) out_of_memory # no victim because we have one already __oom_reap_task_mm mute_unlock(oom_lock) mutex_lock(oom_lock) set MMF_OOM_SKIP select_bad_process # finds a new victim The page allocator prevents from this race by trying to allocate after the lock can be acquired (in __alloc_pages_may_oom) which acts as a last minute check. Moreover page allocator simply doesn't block on the oom_lock and simply retries the whole reclaim process. Memcg oom killer should do the last minute check as well. Call mem_cgroup_margin to do that. Trylock on the oom_lock could be done as well but this doesn't seem to be necessary at this stage. [mhocko@kernel.org: commit log] Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> --- v1 -> v2: - commit log improved by Michal - retitle the subject from "mm, oom: check memcg margin for parallel oom" - code simplicity, per Michal --- mm/memcontrol.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)