diff mbox series

[v2] memcg, oom: check memcg margin for parallel oom

Message ID 1594735034-19190-1-git-send-email-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v2] memcg, oom: check memcg margin for parallel oom | expand

Commit Message

Yafang Shao July 14, 2020, 1:57 p.m. UTC
Memcg oom killer invocation is synchronized by the global oom_lock and
tasks are sleeping on the lock while somebody is selecting the victim or
potentially race with the oom_reaper is releasing the victim's memory.
This can result in a pointless oom killer invocation because a waiter
might be racing with the oom_reaper

        P1              oom_reaper              P2
                        oom_reap_task           mutex_lock(oom_lock)
                                                out_of_memory # no victim because we have one already
                        __oom_reap_task_mm      mute_unlock(oom_lock)
 mutex_lock(oom_lock)
                        set MMF_OOM_SKIP
 select_bad_process
 # finds a new victim

The page allocator prevents from this race by trying to allocate after
the lock can be acquired (in __alloc_pages_may_oom) which acts as a last
minute check. Moreover page allocator simply doesn't block on the
oom_lock and simply retries the whole reclaim process.

Memcg oom killer should do the last minute check as well. Call
mem_cgroup_margin to do that. Trylock on the oom_lock could be done as
well but this doesn't seem to be necessary at this stage.

[mhocko@kernel.org: commit log]
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>

---
v1 -> v2:
- commit log improved by Michal
- retitle the subject from "mm, oom: check memcg margin for parallel oom"
- code simplicity, per Michal
---
 mm/memcontrol.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Michal Hocko July 14, 2020, 2:05 p.m. UTC | #1
On Tue 14-07-20 09:57:14, Yafang Shao wrote:
> Memcg oom killer invocation is synchronized by the global oom_lock and
> tasks are sleeping on the lock while somebody is selecting the victim or
> potentially race with the oom_reaper is releasing the victim's memory.
> This can result in a pointless oom killer invocation because a waiter
> might be racing with the oom_reaper
> 
>         P1              oom_reaper              P2
>                         oom_reap_task           mutex_lock(oom_lock)
>                                                 out_of_memory # no victim because we have one already
>                         __oom_reap_task_mm      mute_unlock(oom_lock)
>  mutex_lock(oom_lock)
>                         set MMF_OOM_SKIP
>  select_bad_process
>  # finds a new victim
> 
> The page allocator prevents from this race by trying to allocate after
> the lock can be acquired (in __alloc_pages_may_oom) which acts as a last
> minute check. Moreover page allocator simply doesn't block on the
> oom_lock and simply retries the whole reclaim process.
> 
> Memcg oom killer should do the last minute check as well. Call
> mem_cgroup_margin to do that. Trylock on the oom_lock could be done as
> well but this doesn't seem to be necessary at this stage.
> 
> [mhocko@kernel.org: commit log]
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
> v1 -> v2:
> - commit log improved by Michal
> - retitle the subject from "mm, oom: check memcg margin for parallel oom"
> - code simplicity, per Michal
> ---
>  mm/memcontrol.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1962232..15e0e18 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
>  	};
> -	bool ret;
> +	bool ret = true;
>  
>  	if (mutex_lock_killable(&oom_lock))
>  		return true;
> +
> +	if (mem_cgroup_margin(memcg) >= (1 << order))
> +		goto unlock;
> +
>  	/*
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
>  	ret = should_force_charge() || out_of_memory(&oc);
> +
> +unlock:
>  	mutex_unlock(&oom_lock);
>  	return ret;
>  }
> -- 
> 1.8.3.1
Chris Down July 14, 2020, 2:30 p.m. UTC | #2
Yafang Shao writes:
>Memcg oom killer invocation is synchronized by the global oom_lock and
>tasks are sleeping on the lock while somebody is selecting the victim or
>potentially race with the oom_reaper is releasing the victim's memory.
>This can result in a pointless oom killer invocation because a waiter
>might be racing with the oom_reaper
>
>        P1              oom_reaper              P2
>                        oom_reap_task           mutex_lock(oom_lock)
>                                                out_of_memory # no victim because we have one already
>                        __oom_reap_task_mm      mute_unlock(oom_lock)
> mutex_lock(oom_lock)
>                        set MMF_OOM_SKIP
> select_bad_process
> # finds a new victim
>
>The page allocator prevents from this race by trying to allocate after
>the lock can be acquired (in __alloc_pages_may_oom) which acts as a last
>minute check. Moreover page allocator simply doesn't block on the
>oom_lock and simply retries the whole reclaim process.
>
>Memcg oom killer should do the last minute check as well. Call
>mem_cgroup_margin to do that. Trylock on the oom_lock could be done as
>well but this doesn't seem to be necessary at this stage.
>
>[mhocko@kernel.org: commit log]
>Suggested-by: Michal Hocko <mhocko@kernel.org>
>Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>Cc: Michal Hocko <mhocko@kernel.org>
>Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
>Cc: David Rientjes <rientjes@google.com>
>Cc: Johannes Weiner <hannes@cmpxchg.org>

Good catch, thanks.

Acked-by: Chris Down <chris@chrisdown.name>
David Rientjes July 14, 2020, 6:46 p.m. UTC | #3
On Tue, 14 Jul 2020, Yafang Shao wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1962232..15e0e18 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
>  	};
> -	bool ret;
> +	bool ret = true;
>  
>  	if (mutex_lock_killable(&oom_lock))
>  		return true;
> +
> +	if (mem_cgroup_margin(memcg) >= (1 << order))
> +		goto unlock;
> +
>  	/*
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
>  	ret = should_force_charge() || out_of_memory(&oc);
> +
> +unlock:
>  	mutex_unlock(&oom_lock);
>  	return ret;
>  }

Hi Yafang,

We've run with a patch very much like this for several years and it works 
quite successfully to prevent the unnecessary oom killing of processes.

We do this in out_of_memory() directly, however, because we found that we 
could prevent even *more* unnecessary killing if we checked this at the 
"point of no return" because the selection of processes takes some 
additional time when we might resolve the oom condition.

Some may argue that this is unnecessarily exposing mem_cgroup_margin() to 
generic mm code, but in the interest of preventing any unnecessary oom 
kill we've found it to be helpful.

I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089.

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
+unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
+
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
@@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 {
 }
 
+static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
+{
+}
+
 static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
 						  bool in_low_reclaim)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
  * Returns the maximum amount of memory @mem can be charged with, in
  * pages.
  */
-static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
+unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 {
 	unsigned long margin = 0;
 	unsigned long count;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc)
 		if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
 			panic("System is deadlocked on memory\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL)
+	if (oc->chosen && oc->chosen != (void *)-1UL) {
+		if (is_memcg_oom(oc)) {
+			/*
+			 * If a memcg is now under its limit or current will be
+			 * exiting and freeing memory, avoid needlessly killing
+			 * chosen.
+			 */
+			if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) ||
+			    task_will_free_mem(current)) {
+				put_task_struct(oc->chosen);
+				return true;
+			}
+		}
+
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
+	}
 	return !!oc->chosen;
 }
Yafang Shao July 15, 2020, 1:44 a.m. UTC | #4
On Wed, Jul 15, 2020 at 2:46 AM David Rientjes <rientjes@google.com> wrote:
>
> On Tue, 14 Jul 2020, Yafang Shao wrote:
>
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 1962232..15e0e18 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >               .gfp_mask = gfp_mask,
> >               .order = order,
> >       };
> > -     bool ret;
> > +     bool ret = true;
> >
> >       if (mutex_lock_killable(&oom_lock))
> >               return true;
> > +
> > +     if (mem_cgroup_margin(memcg) >= (1 << order))
> > +             goto unlock;
> > +
> >       /*
> >        * A few threads which were not waiting at mutex_lock_killable() can
> >        * fail to bail out. Therefore, check again after holding oom_lock.
> >        */
> >       ret = should_force_charge() || out_of_memory(&oc);
> > +
> > +unlock:
> >       mutex_unlock(&oom_lock);
> >       return ret;
> >  }
>
> Hi Yafang,
>
> We've run with a patch very much like this for several years and it works
> quite successfully to prevent the unnecessary oom killing of processes.
>
> We do this in out_of_memory() directly, however, because we found that we
> could prevent even *more* unnecessary killing if we checked this at the
> "point of no return" because the selection of processes takes some
> additional time when we might resolve the oom condition.
>

Hi David,

Your proposal could also resolve the issue, but I'm wondering why do
it specifically for memcg oom?
Doesn't it apply to global oom?
For example, in the global oom, when selecting the processes, the
others might free some pages and then it might allocate pages
successfully.

> Some may argue that this is unnecessarily exposing mem_cgroup_margin() to
> generic mm code, but in the interest of preventing any unnecessary oom
> kill we've found it to be helpful.
>
> I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089.
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  void mem_cgroup_split_huge_fixup(struct page *head);
>  #endif
>
> +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
> +
>  #else /* CONFIG_MEMCG */
>
>  #define MEM_CGROUP_ID_SHIFT    0
> @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  {
>  }
>
> +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> +{
> +}
> +
>  static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
>                                                   bool in_low_reclaim)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
>   * Returns the maximum amount of memory @mem can be charged with, in
>   * pages.
>   */
> -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>  {
>         unsigned long margin = 0;
>         unsigned long count;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc)
>                 if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
>                         panic("System is deadlocked on memory\n");
>         }
> -       if (oc->chosen && oc->chosen != (void *)-1UL)
> +       if (oc->chosen && oc->chosen != (void *)-1UL) {
> +               if (is_memcg_oom(oc)) {
> +                       /*
> +                        * If a memcg is now under its limit or current will be
> +                        * exiting and freeing memory, avoid needlessly killing
> +                        * chosen.
> +                        */
> +                       if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) ||
> +                           task_will_free_mem(current)) {
> +                               put_task_struct(oc->chosen);
> +                               return true;
> +                       }
> +               }
> +
>                 oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
>                                  "Memory cgroup out of memory");
> +       }
>         return !!oc->chosen;
>  }
>
David Rientjes July 15, 2020, 2:44 a.m. UTC | #5
On Wed, 15 Jul 2020, Yafang Shao wrote:

> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 1962232..15e0e18 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >               .gfp_mask = gfp_mask,
> > >               .order = order,
> > >       };
> > > -     bool ret;
> > > +     bool ret = true;
> > >
> > >       if (mutex_lock_killable(&oom_lock))
> > >               return true;
> > > +
> > > +     if (mem_cgroup_margin(memcg) >= (1 << order))
> > > +             goto unlock;
> > > +
> > >       /*
> > >        * A few threads which were not waiting at mutex_lock_killable() can
> > >        * fail to bail out. Therefore, check again after holding oom_lock.
> > >        */
> > >       ret = should_force_charge() || out_of_memory(&oc);
> > > +
> > > +unlock:
> > >       mutex_unlock(&oom_lock);
> > >       return ret;
> > >  }
> >
> > Hi Yafang,
> >
> > We've run with a patch very much like this for several years and it works
> > quite successfully to prevent the unnecessary oom killing of processes.
> >
> > We do this in out_of_memory() directly, however, because we found that we
> > could prevent even *more* unnecessary killing if we checked this at the
> > "point of no return" because the selection of processes takes some
> > additional time when we might resolve the oom condition.
> >
> 
> Hi David,
> 
> Your proposal could also resolve the issue,

It has successfully resolved it for several years in our kernel, we tried 
an approach similiar to yours but saw many instances where memcg oom kills 
continued to proceed even though the memcg information dumped to the 
kernel log showed memory available.

If this was a page or two that became available due to memory freeing, 
it's not a significant difference.  Instead, if this races with an oom 
notification and a process exiting or being SIGKILL'd, it becomes much 
harder to explain to a user why their process was oom killed when there 
are tens of megabytes of memory available as shown by the kernel log (the 
freeing/exiting happened during a particularly long iteration of processes 
attached to the memcg, for example).

That's what motivated a change to moving this to out_of_memory() directly, 
we found that it prevented even more unnecessary oom kills, which is a 
very good thing.  It may only be easily observable and make a significant 
difference at very large scale, however.

> but I'm wondering why do
> it specifically for memcg oom?
> Doesn't it apply to global oom?
> For example, in the global oom, when selecting the processes, the
> others might free some pages and then it might allocate pages
> successfully.
> 

It's more complex because memory being allocated from the page allocator 
must be physically contiguous, it's not a simple matter of comparing the 
margin of available memory to the memory being charged.  It could 
theoretically be done but I haven't seen a use case where it has actually 
mattered as opposed to memcg oom where it can happen quite readily at 
scale.  When memory is uncharged to a memcg because of large freeing or a 
process exiting, that's immediately chargable by another process in the 
same hierarchy because of its isolation as opposed to the page allocator 
where that memory is up for grabs and anything on the system could 
allocate it.

> > Some may argue that this is unnecessarily exposing mem_cgroup_margin() to
> > generic mm code, but in the interest of preventing any unnecessary oom
> > kill we've found it to be helpful.
> >
> > I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089.
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> >  void mem_cgroup_split_huge_fixup(struct page *head);
> >  #endif
> >
> > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
> > +
> >  #else /* CONFIG_MEMCG */
> >
> >  #define MEM_CGROUP_ID_SHIFT    0
> > @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> >  {
> >  }
> >
> > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > +{
> > +}
> > +
> >  static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> >                                                   bool in_low_reclaim)
> >  {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> >   * Returns the maximum amount of memory @mem can be charged with, in
> >   * pages.
> >   */
> > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> >  {
> >         unsigned long margin = 0;
> >         unsigned long count;
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc)
> >                 if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
> >                         panic("System is deadlocked on memory\n");
> >         }
> > -       if (oc->chosen && oc->chosen != (void *)-1UL)
> > +       if (oc->chosen && oc->chosen != (void *)-1UL) {
> > +               if (is_memcg_oom(oc)) {
> > +                       /*
> > +                        * If a memcg is now under its limit or current will be
> > +                        * exiting and freeing memory, avoid needlessly killing
> > +                        * chosen.
> > +                        */
> > +                       if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) ||
> > +                           task_will_free_mem(current)) {
> > +                               put_task_struct(oc->chosen);
> > +                               return true;
> > +                       }
> > +               }
> > +
> >                 oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> >                                  "Memory cgroup out of memory");
> > +       }
> >         return !!oc->chosen;
> >  }
> >
> 
> 
> -- 
> Thanks
> Yafang
>
Yafang Shao July 15, 2020, 3:10 a.m. UTC | #6
On Wed, Jul 15, 2020 at 10:44 AM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 15 Jul 2020, Yafang Shao wrote:
>
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 1962232..15e0e18 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > >               .gfp_mask = gfp_mask,
> > > >               .order = order,
> > > >       };
> > > > -     bool ret;
> > > > +     bool ret = true;
> > > >
> > > >       if (mutex_lock_killable(&oom_lock))
> > > >               return true;
> > > > +
> > > > +     if (mem_cgroup_margin(memcg) >= (1 << order))
> > > > +             goto unlock;
> > > > +
> > > >       /*
> > > >        * A few threads which were not waiting at mutex_lock_killable() can
> > > >        * fail to bail out. Therefore, check again after holding oom_lock.
> > > >        */
> > > >       ret = should_force_charge() || out_of_memory(&oc);
> > > > +
> > > > +unlock:
> > > >       mutex_unlock(&oom_lock);
> > > >       return ret;
> > > >  }
> > >
> > > Hi Yafang,
> > >
> > > We've run with a patch very much like this for several years and it works
> > > quite successfully to prevent the unnecessary oom killing of processes.
> > >
> > > We do this in out_of_memory() directly, however, because we found that we
> > > could prevent even *more* unnecessary killing if we checked this at the
> > > "point of no return" because the selection of processes takes some
> > > additional time when we might resolve the oom condition.
> > >
> >
> > Hi David,
> >
> > Your proposal could also resolve the issue,
>
> It has successfully resolved it for several years in our kernel, we tried
> an approach similiar to yours but saw many instances where memcg oom kills
> continued to proceed even though the memcg information dumped to the
> kernel log showed memory available.
>
> If this was a page or two that became available due to memory freeing,
> it's not a significant difference.  Instead, if this races with an oom
> notification and a process exiting or being SIGKILL'd, it becomes much
> harder to explain to a user why their process was oom killed when there
> are tens of megabytes of memory available as shown by the kernel log (the
> freeing/exiting happened during a particularly long iteration of processes
> attached to the memcg, for example).
>
> That's what motivated a change to moving this to out_of_memory() directly,
> we found that it prevented even more unnecessary oom kills, which is a
> very good thing.  It may only be easily observable and make a significant
> difference at very large scale, however.
>

Thanks for the clarification.

If it is the race which causes this issue and we want to reduce the
race window, I don't know whether it is proper to check the memcg
margin in out_of_memory() or  do it before calling do_send_sig_info().
Because per my understanding, dump_header() always takes much more
time than select_bad_process() especially if there're slow consoles.
So the race might easily happen when doing dump_header()  or dumping
other information, but if we check the memcg margin after dumping this
oom info, it would be strange to dump so much oom logs without killing
a process.

> > but I'm wondering why do
> > it specifically for memcg oom?
> > Doesn't it apply to global oom?
> > For example, in the global oom, when selecting the processes, the
> > others might free some pages and then it might allocate pages
> > successfully.
> >
>
> It's more complex because memory being allocated from the page allocator
> must be physically contiguous, it's not a simple matter of comparing the
> margin of available memory to the memory being charged.  It could
> theoretically be done but I haven't seen a use case where it has actually
> mattered as opposed to memcg oom where it can happen quite readily at
> scale.  When memory is uncharged to a memcg because of large freeing or a
> process exiting, that's immediately chargable by another process in the
> same hierarchy because of its isolation as opposed to the page allocator
> where that memory is up for grabs and anything on the system could
> allocate it.
>
> > > Some may argue that this is unnecessarily exposing mem_cgroup_margin() to
> > > generic mm code, but in the interest of preventing any unnecessary oom
> > > kill we've found it to be helpful.
> > >
> > > I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089.
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> > >  void mem_cgroup_split_huge_fixup(struct page *head);
> > >  #endif
> > >
> > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
> > > +
> > >  #else /* CONFIG_MEMCG */
> > >
> > >  #define MEM_CGROUP_ID_SHIFT    0
> > > @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> > >  {
> > >  }
> > >
> > > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > > +{
> > > +}
> > > +
> > >  static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> > >                                                   bool in_low_reclaim)
> > >  {
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> > >   * Returns the maximum amount of memory @mem can be charged with, in
> > >   * pages.
> > >   */
> > > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > >  {
> > >         unsigned long margin = 0;
> > >         unsigned long count;
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc)
> > >                 if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
> > >                         panic("System is deadlocked on memory\n");
> > >         }
> > > -       if (oc->chosen && oc->chosen != (void *)-1UL)
> > > +       if (oc->chosen && oc->chosen != (void *)-1UL) {
> > > +               if (is_memcg_oom(oc)) {
> > > +                       /*
> > > +                        * If a memcg is now under its limit or current will be
> > > +                        * exiting and freeing memory, avoid needlessly killing
> > > +                        * chosen.
> > > +                        */
> > > +                       if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) ||
> > > +                           task_will_free_mem(current)) {
> > > +                               put_task_struct(oc->chosen);
> > > +                               return true;
> > > +                       }
> > > +               }
> > > +
> > >                 oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> > >                                  "Memory cgroup out of memory");
> > > +       }
> > >         return !!oc->chosen;
> > >  }
> > >
> >
> >
> > --
> > Thanks
> > Yafang
> >
David Rientjes July 15, 2020, 3:18 a.m. UTC | #7
On Wed, 15 Jul 2020, Yafang Shao wrote:

> > It has successfully resolved it for several years in our kernel, we tried
> > an approach similiar to yours but saw many instances where memcg oom kills
> > continued to proceed even though the memcg information dumped to the
> > kernel log showed memory available.
> >
> > If this was a page or two that became available due to memory freeing,
> > it's not a significant difference.  Instead, if this races with an oom
> > notification and a process exiting or being SIGKILL'd, it becomes much
> > harder to explain to a user why their process was oom killed when there
> > are tens of megabytes of memory available as shown by the kernel log (the
> > freeing/exiting happened during a particularly long iteration of processes
> > attached to the memcg, for example).
> >
> > That's what motivated a change to moving this to out_of_memory() directly,
> > we found that it prevented even more unnecessary oom kills, which is a
> > very good thing.  It may only be easily observable and make a significant
> > difference at very large scale, however.
> >
> 
> Thanks for the clarification.
> 
> If it is the race which causes this issue and we want to reduce the
> race window, I don't know whether it is proper to check the memcg
> margin in out_of_memory() or  do it before calling do_send_sig_info().
> Because per my understanding, dump_header() always takes much more
> time than select_bad_process() especially if there're slow consoles.
> So the race might easily happen when doing dump_header()  or dumping
> other information, but if we check the memcg margin after dumping this
> oom info, it would be strange to dump so much oom logs without killing
> a process.
> 

Absolutely correct :)  In my proposed patch, we declare dump_header() as 
the "point of no return" since we don't want to dump oom kill information 
to the kernel log when nothing is actually killed.  We could abort at the 
very last minute, as you mention, but I think that may have an adverse 
impact on anything that cares about that log message.
Yafang Shao July 15, 2020, 3:31 a.m. UTC | #8
On Wed, Jul 15, 2020 at 11:18 AM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 15 Jul 2020, Yafang Shao wrote:
>
> > > It has successfully resolved it for several years in our kernel, we tried
> > > an approach similiar to yours but saw many instances where memcg oom kills
> > > continued to proceed even though the memcg information dumped to the
> > > kernel log showed memory available.
> > >
> > > If this was a page or two that became available due to memory freeing,
> > > it's not a significant difference.  Instead, if this races with an oom
> > > notification and a process exiting or being SIGKILL'd, it becomes much
> > > harder to explain to a user why their process was oom killed when there
> > > are tens of megabytes of memory available as shown by the kernel log (the
> > > freeing/exiting happened during a particularly long iteration of processes
> > > attached to the memcg, for example).
> > >
> > > That's what motivated a change to moving this to out_of_memory() directly,
> > > we found that it prevented even more unnecessary oom kills, which is a
> > > very good thing.  It may only be easily observable and make a significant
> > > difference at very large scale, however.
> > >
> >
> > Thanks for the clarification.
> >
> > If it is the race which causes this issue and we want to reduce the
> > race window, I don't know whether it is proper to check the memcg
> > margin in out_of_memory() or  do it before calling do_send_sig_info().
> > Because per my understanding, dump_header() always takes much more
> > time than select_bad_process() especially if there're slow consoles.
> > So the race might easily happen when doing dump_header()  or dumping
> > other information, but if we check the memcg margin after dumping this
> > oom info, it would be strange to dump so much oom logs without killing
> > a process.
> >
>
> Absolutely correct :)  In my proposed patch, we declare dump_header() as
> the "point of no return" since we don't want to dump oom kill information
> to the kernel log when nothing is actually killed.  We could abort at the
> very last minute, as you mention, but I think that may have an adverse
> impact on anything that cares about that log message.

How about storing the memcg information in oom_control when the memcg
oom is triggered, and then show this information in dump_header() ?
IOW, the OOM info really shows the memcg status when oom occurs,
rather than the memcg status when this info is printed.
Michal Hocko July 15, 2020, 6:56 a.m. UTC | #9
On Wed 15-07-20 11:10:07, Yafang Shao wrote:
[...]
> If it is the race which causes this issue and we want to reduce the
> race window, I don't know whether it is proper to check the memcg
> margin in out_of_memory() or  do it before calling do_send_sig_info().
> Because per my understanding, dump_header() always takes much more
> time than select_bad_process() especially if there're slow consoles.
> So the race might easily happen when doing dump_header()  or dumping
> other information, but if we check the memcg margin after dumping this
> oom info, it would be strange to dump so much oom logs without killing
> a process.

Yes, this is my experience as well. Unless there are gazillions of tasks
the oom victim selection should be reasonably swift. It is usually the
oom report which takes the most time and this is a huge race window.

So I think it would be best to go with your patch for now. It is more in
line with the global oom flow and it is much easier to reason about.
David Rientjes July 15, 2020, 5:30 p.m. UTC | #10
On Wed, 15 Jul 2020, Yafang Shao wrote:

> > > If it is the race which causes this issue and we want to reduce the
> > > race window, I don't know whether it is proper to check the memcg
> > > margin in out_of_memory() or  do it before calling do_send_sig_info().
> > > Because per my understanding, dump_header() always takes much more
> > > time than select_bad_process() especially if there're slow consoles.
> > > So the race might easily happen when doing dump_header()  or dumping
> > > other information, but if we check the memcg margin after dumping this
> > > oom info, it would be strange to dump so much oom logs without killing
> > > a process.
> > >
> >
> > Absolutely correct :)  In my proposed patch, we declare dump_header() as
> > the "point of no return" since we don't want to dump oom kill information
> > to the kernel log when nothing is actually killed.  We could abort at the
> > very last minute, as you mention, but I think that may have an adverse
> > impact on anything that cares about that log message.
> 
> How about storing the memcg information in oom_control when the memcg
> oom is triggered, and then show this information in dump_header() ?
> IOW, the OOM info really shows the memcg status when oom occurs,
> rather than the memcg status when this info is printed.
> 

We actually do that too in our kernel but for slightly other reasons :)  
It's pretty interesting how a lot of our previous concerns with memcg oom 
killing have been echoed by you in this thread.  But yes, we store vital 
information about the memcg at the time of the first oom event when the 
oom killer is disabled (to allow userspace to determine what the best 
course of action is).

But regardless of whether we present previous data to the user in the 
kernel log or not, we've determined that oom killing a process is a 
serious matter and go to any lengths possible to avoid having to do it.  
For us, that means waiting until the "point of no return" to either go 
ahead with oom killing a process or aborting and retrying the charge.

I don't think moving the mem_cgroup_margin() check to out_of_memory() 
right before printing the oom info and killing the process is a very 
invasive patch.  Any strong preference against doing it that way?  I think 
moving the check as late as possible to save a process from being killed 
when racing with an exiter or killed process (including perhaps current) 
has a pretty clear motivation.
Yafang Shao July 16, 2020, 2:38 a.m. UTC | #11
On Thu, Jul 16, 2020 at 1:30 AM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 15 Jul 2020, Yafang Shao wrote:
>
> > > > If it is the race which causes this issue and we want to reduce the
> > > > race window, I don't know whether it is proper to check the memcg
> > > > margin in out_of_memory() or  do it before calling do_send_sig_info().
> > > > Because per my understanding, dump_header() always takes much more
> > > > time than select_bad_process() especially if there're slow consoles.
> > > > So the race might easily happen when doing dump_header()  or dumping
> > > > other information, but if we check the memcg margin after dumping this
> > > > oom info, it would be strange to dump so much oom logs without killing
> > > > a process.
> > > >
> > >
> > > Absolutely correct :)  In my proposed patch, we declare dump_header() as
> > > the "point of no return" since we don't want to dump oom kill information
> > > to the kernel log when nothing is actually killed.  We could abort at the
> > > very last minute, as you mention, but I think that may have an adverse
> > > impact on anything that cares about that log message.
> >
> > How about storing the memcg information in oom_control when the memcg
> > oom is triggered, and then show this information in dump_header() ?
> > IOW, the OOM info really shows the memcg status when oom occurs,
> > rather than the memcg status when this info is printed.
> >
>
> We actually do that too in our kernel but for slightly other reasons :)
> It's pretty interesting how a lot of our previous concerns with memcg oom
> killing have been echoed by you in this thread.

These should be common concerns of container users :)
I'm a heavy container user for now.

>  But yes, we store vital
> information about the memcg at the time of the first oom event when the
> oom killer is disabled (to allow userspace to determine what the best
> course of action is).
>

It would be better if you could upstream the features in your kernel,
and I think it could also help the other users.

> But regardless of whether we present previous data to the user in the
> kernel log or not, we've determined that oom killing a process is a
> serious matter and go to any lengths possible to avoid having to do it.
> For us, that means waiting until the "point of no return" to either go
> ahead with oom killing a process or aborting and retrying the charge.
>
> I don't think moving the mem_cgroup_margin() check to out_of_memory()
> right before printing the oom info and killing the process is a very
> invasive patch.  Any strong preference against doing it that way?  I think
> moving the check as late as possible to save a process from being killed
> when racing with an exiter or killed process (including perhaps current)
> has a pretty clear motivation.

I understand what you mean "point of no return", but that seems a
workaround rather than a fix.
If you don't want to kill unnecessary processes, then checking the
memcg margin before sending sigkill is better, because as I said
before the race will be most likely happening in dump_header().
If you don't want to show strange OOM information like "your process
was oom killed and it shows usage is 60MB in a memcg limited
to 100MB", it is better to get the snapshot of the OOM when it is
triggered and then show it later, and I think it could also apply to
the global OOM.

While my patch means to fix the issue caused by parallel OOM, because
the others are waiting oom_lock while one process is doing OOM. And as
explained by Michal before, it is more in
line with the global oom flow and it is much easier to reason about.
Tetsuo Handa July 16, 2020, 5:54 a.m. UTC | #12
On 2020/07/16 2:30, David Rientjes wrote:
> But regardless of whether we present previous data to the user in the 
> kernel log or not, we've determined that oom killing a process is a 
> serious matter and go to any lengths possible to avoid having to do it.  
> For us, that means waiting until the "point of no return" to either go 
> ahead with oom killing a process or aborting and retrying the charge.
> 
> I don't think moving the mem_cgroup_margin() check to out_of_memory() 
> right before printing the oom info and killing the process is a very 
> invasive patch.  Any strong preference against doing it that way?  I think 
> moving the check as late as possible to save a process from being killed 
> when racing with an exiter or killed process (including perhaps current) 
> has a pretty clear motivation.
> 

How about ignoring MMF_OOM_SKIP for once? I think this has almost same
effect as moving the mem_cgroup_margin() check to out_of_memory() 
right before printing the oom info and killing the process.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 48e0db54d838..88170af3b9eb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -322,7 +322,8 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * any memory is quite low.
 	 */
 	if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
+		if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags) &&
+		    !test_and_clear_bit(MMF_OOM_REAP_QUEUED, &task->signal->oom_mm->flags))
 			goto next;
 		goto abort;
 	}
@@ -658,7 +659,8 @@ static int oom_reaper(void *unused)
 static void wake_oom_reaper(struct task_struct *tsk)
 {
 	/* mm is already queued? */
-	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
+	if (test_bit(MMF_OOM_SKIP, &tsk->signal->oom_mm->flags) ||
+	    test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
 		return;
 
 	get_task_struct(tsk);
Michal Hocko July 16, 2020, 6:08 a.m. UTC | #13
On Wed 15-07-20 10:30:51, David Rientjes wrote:
[...]
> I don't think moving the mem_cgroup_margin() check to out_of_memory() 
> right before printing the oom info and killing the process is a very 
> invasive patch.  Any strong preference against doing it that way?  I think 
> moving the check as late as possible to save a process from being killed 
> when racing with an exiter or killed process (including perhaps current) 
> has a pretty clear motivation.

We have been through this discussion several times in the past IIRC
The conclusion has been that the allocator (charging path for
the memcg) is the one to define OOM situation. This is an inherently
racy situation as long as we are not synchronizing oom with the world,
which I believe we agree, we do not want to do. There are few exceptions
to bail out early from the oom under certain situations and the trend
was to remove some of the existing ones rather than adding new because
they had subtle side effects and were prone to lockups.

As much as it might sound attractive to move mem_cgroup_margin resp.
last allocation attempt closer to the actual oom killing I haven't seen
any convincing data that would support that such a change would make a
big difference. select_bad_process is not a free operation as it scales
with the number of tasks in the oom domain but it shouldn't be a super
expensive. The oom reporting is by far the most expensive part of the
operation.

That being said, really convincing data should be presented in order
to do such a change. I do not think we want to do that just in case.
Michal Hocko July 16, 2020, 6:11 a.m. UTC | #14
On Thu 16-07-20 14:54:01, Tetsuo Handa wrote:
> On 2020/07/16 2:30, David Rientjes wrote:
> > But regardless of whether we present previous data to the user in the 
> > kernel log or not, we've determined that oom killing a process is a 
> > serious matter and go to any lengths possible to avoid having to do it.  
> > For us, that means waiting until the "point of no return" to either go 
> > ahead with oom killing a process or aborting and retrying the charge.
> > 
> > I don't think moving the mem_cgroup_margin() check to out_of_memory() 
> > right before printing the oom info and killing the process is a very 
> > invasive patch.  Any strong preference against doing it that way?  I think 
> > moving the check as late as possible to save a process from being killed 
> > when racing with an exiter or killed process (including perhaps current) 
> > has a pretty clear motivation.
> > 
> 
> How about ignoring MMF_OOM_SKIP for once? I think this has almost same
> effect as moving the mem_cgroup_margin() check to out_of_memory() 
> right before printing the oom info and killing the process.

How would that help with races when a task is exiting while the oom
selects a victim? We are not talking about races with the oom_reaper
IIUC. Btw. if races with the oom_reaper are a concern then I would much
rather delay the wake up than complicate the existing protocol even
further.
David Rientjes July 16, 2020, 6:56 a.m. UTC | #15
On Thu, 16 Jul 2020, Michal Hocko wrote:

> > I don't think moving the mem_cgroup_margin() check to out_of_memory() 
> > right before printing the oom info and killing the process is a very 
> > invasive patch.  Any strong preference against doing it that way?  I think 
> > moving the check as late as possible to save a process from being killed 
> > when racing with an exiter or killed process (including perhaps current) 
> > has a pretty clear motivation.
> 
> We have been through this discussion several times in the past IIRC
> The conclusion has been that the allocator (charging path for
> the memcg) is the one to define OOM situation. This is an inherently
> racy situation as long as we are not synchronizing oom with the world,
> which I believe we agree, we do not want to do. There are few exceptions
> to bail out early from the oom under certain situations and the trend
> was to remove some of the existing ones rather than adding new because
> they had subtle side effects and were prone to lockups.
> 
> As much as it might sound attractive to move mem_cgroup_margin resp.
> last allocation attempt closer to the actual oom killing I haven't seen
> any convincing data that would support that such a change would make a
> big difference. select_bad_process is not a free operation as it scales
> with the number of tasks in the oom domain but it shouldn't be a super
> expensive. The oom reporting is by far the most expensive part of the
> operation.
> 
> That being said, really convincing data should be presented in order
> to do such a change. I do not think we want to do that just in case.

It's not possible to present data because we've had such a check for years 
in our fleet so I can't say that it has prevented X unnecessary oom kills 
compared to doing the check prior to calling out_of_memory().  I'm hoping 
that can be understood.

Since Yafang is facing the same issue, and there is no significant 
downside to doing the mem_cgroup_margin() check prior to 
oom_kill_process() (or checking task_will_free_mem(current)), and it's 
acknowledged that it *can* prevent unnecessary oom killing, which is a 
very good thing, I'd like to understand why such resistance to it.

Killing a user process is a serious matter.  I would fully agree if the 
margin is only one page: it's still better to kill something off.  But 
when a process has uncharged memory by means induced by a process waiting 
on oom notication, such as a userspace kill or dropping of caches from 
your malloc implementation, that uncharge can be quite substantial and oom 
killing is then unnecessary.

I can refresh the patch and send it formally.
David Rientjes July 16, 2020, 7:04 a.m. UTC | #16
On Thu, 16 Jul 2020, Yafang Shao wrote:

> >  But yes, we store vital
> > information about the memcg at the time of the first oom event when the
> > oom killer is disabled (to allow userspace to determine what the best
> > course of action is).
> >
> 
> It would be better if you could upstream the features in your kernel,
> and I think it could also help the other users.
> 

Everything we've discussed so far has been proposed in the past, actually.  
I think we stress the oom killer and use it at scale that others do not, 
so only a subset of users would find it interesting.  You are very likely 
one of those subset of users.

We should certainly talk about other issues that we have run into that 
make the upstream oom killer unusable.  Are there other areas that you're 
currently focused on or having trouble with?  I'd be happy to have a 
discussion on how we have resolved a lot of its issues.

> I understand what you mean "point of no return", but that seems a
> workaround rather than a fix.
> If you don't want to kill unnecessary processes, then checking the
> memcg margin before sending sigkill is better, because as I said
> before the race will be most likely happening in dump_header().
> If you don't want to show strange OOM information like "your process
> was oom killed and it shows usage is 60MB in a memcg limited
> to 100MB", it is better to get the snapshot of the OOM when it is
> triggered and then show it later, and I think it could also apply to
> the global OOM.
> 

It's likely a misunderstanding: I wasn't necessarily concerned about 
showing 60MB in a memcg limited to 100MB, that part we can deal with, the 
concern was after dumping all of that great information that instead of 
getting a "Killed process..." we get a "Oh, there's memory now, just 
kidding about everything we just dumped" ;)

We could likely enlighten userspace about that so that we don't consider 
that to be an actual oom kill.  But I also very much agree that after 
dump_header() would be appropriate as well since the goal is to prevent 
unnecessary oom killing.

Would you mind sending a patch to check mem_cgroup_margin() on 
is_memcg_oom() prior to sending the SIGKILL to the victim and printing the 
"Killed process..." line?  We'd need a line that says "xKB of memory now 
available -- suppressing oom kill" or something along those lines so 
userspace understands what happened.  But the memory info that it emits 
both for the state of the memcg and system RAM may also be helpful to 
understand why we got to the oom kill in the first place, which is also a 
good thing.

I'd happy ack that patch since it would be a comprehensive solution that 
avoids oom kill of user processes at all costs, which is a goal I think we 
can all rally behind.
David Rientjes July 16, 2020, 7:06 a.m. UTC | #17
On Thu, 16 Jul 2020, Michal Hocko wrote:

> > > But regardless of whether we present previous data to the user in the 
> > > kernel log or not, we've determined that oom killing a process is a 
> > > serious matter and go to any lengths possible to avoid having to do it.  
> > > For us, that means waiting until the "point of no return" to either go 
> > > ahead with oom killing a process or aborting and retrying the charge.
> > > 
> > > I don't think moving the mem_cgroup_margin() check to out_of_memory() 
> > > right before printing the oom info and killing the process is a very 
> > > invasive patch.  Any strong preference against doing it that way?  I think 
> > > moving the check as late as possible to save a process from being killed 
> > > when racing with an exiter or killed process (including perhaps current) 
> > > has a pretty clear motivation.
> > > 
> > 
> > How about ignoring MMF_OOM_SKIP for once? I think this has almost same
> > effect as moving the mem_cgroup_margin() check to out_of_memory() 
> > right before printing the oom info and killing the process.
> 
> How would that help with races when a task is exiting while the oom
> selects a victim? We are not talking about races with the oom_reaper
> IIUC. Btw. if races with the oom_reaper are a concern then I would much
> rather delay the wake up than complicate the existing protocol even
> further.

Right, this isn't a concern about racing with oom reaping or when finding 
processes that have already been selected as the oom victim.  This is 
about (potentially significant) amounts of memory that has been uncharged 
to the memcg hierarchy between the failure of reclaim to uncharge memory 
and the actual killing of a user process.
Michal Hocko July 16, 2020, 7:12 a.m. UTC | #18
On Wed 15-07-20 23:56:11, David Rientjes wrote:
> On Thu, 16 Jul 2020, Michal Hocko wrote:
> 
> > > I don't think moving the mem_cgroup_margin() check to out_of_memory() 
> > > right before printing the oom info and killing the process is a very 
> > > invasive patch.  Any strong preference against doing it that way?  I think 
> > > moving the check as late as possible to save a process from being killed 
> > > when racing with an exiter or killed process (including perhaps current) 
> > > has a pretty clear motivation.
> > 
> > We have been through this discussion several times in the past IIRC
> > The conclusion has been that the allocator (charging path for
> > the memcg) is the one to define OOM situation. This is an inherently
> > racy situation as long as we are not synchronizing oom with the world,
> > which I believe we agree, we do not want to do. There are few exceptions
> > to bail out early from the oom under certain situations and the trend
> > was to remove some of the existing ones rather than adding new because
> > they had subtle side effects and were prone to lockups.
> > 
> > As much as it might sound attractive to move mem_cgroup_margin resp.
> > last allocation attempt closer to the actual oom killing I haven't seen
> > any convincing data that would support that such a change would make a
> > big difference. select_bad_process is not a free operation as it scales
> > with the number of tasks in the oom domain but it shouldn't be a super
> > expensive. The oom reporting is by far the most expensive part of the
> > operation.
> > 
> > That being said, really convincing data should be presented in order
> > to do such a change. I do not think we want to do that just in case.
> 
> It's not possible to present data because we've had such a check for years 
> in our fleet so I can't say that it has prevented X unnecessary oom kills 
> compared to doing the check prior to calling out_of_memory().  I'm hoping 
> that can be understood.
> 
> Since Yafang is facing the same issue, and there is no significant 
> downside to doing the mem_cgroup_margin() check prior to 
> oom_kill_process() (or checking task_will_free_mem(current)), and it's 
> acknowledged that it *can* prevent unnecessary oom killing, which is a 
> very good thing, I'd like to understand why such resistance to it.

Because exactly this kind of arguments has led to quite some "should be
fine" heuristics which kicked back: do not kill exiting task, sacrifice
child instead of a victim just to name few. All of them make some sense
from a glance but they can serious kick back as the experience has
thought us.

Really, I do not see what is so hard to understand that each heuristic,
especially those to subtle areas like oom definitely is, needs data to
justify them. We are running this for years is really not an argument.
Sure arguing that your workload leads to x amount of false positives
and just shifting the check to later saves y amount of them sounds like
a relevant argument to me.

> Killing a user process is a serious matter.

It definitely is and I believe that nobody is questioning that. The oom
situation is a serious matter on its own. It says that the system has
failed to balance the memory consumption and the oom killer is the
_last_ resort action to be taken.

> I would fully agree if the 
> margin is only one page: it's still better to kill something off.

And exactly these kinds of details make any heuristic subtle and hard
maintain.

> But 
> when a process has uncharged memory by means induced by a process waiting 
> on oom notication, such as a userspace kill or dropping of caches from 
> your malloc implementation, that uncharge can be quite substantial and oom 
> killing is then unnecessary.

That should happen quite some time before the hard limit is reached and
we have means to achieve that. High limit is there to help with
pro-active reclaim before the OOM happens on the hard limit. If you are
stuck with the v1 then oom disable and shifting the whole logic into the
userspace is another variant.
 
> I can refresh the patch and send it formally.
Yafang Shao July 16, 2020, 11:53 a.m. UTC | #19
On Thu, Jul 16, 2020 at 3:04 PM David Rientjes <rientjes@google.com> wrote:
>
> On Thu, 16 Jul 2020, Yafang Shao wrote:
>
> > >  But yes, we store vital
> > > information about the memcg at the time of the first oom event when the
> > > oom killer is disabled (to allow userspace to determine what the best
> > > course of action is).
> > >
> >
> > It would be better if you could upstream the features in your kernel,
> > and I think it could also help the other users.
> >
>
> Everything we've discussed so far has been proposed in the past, actually.
> I think we stress the oom killer and use it at scale that others do not,
> so only a subset of users would find it interesting.  You are very likely
> one of those subset of users.
>
> We should certainly talk about other issues that we have run into that
> make the upstream oom killer unusable.  Are there other areas that you're
> currently focused on or having trouble with?  I'd be happy to have a
> discussion on how we have resolved a lot of its issues.
>
> > I understand what you mean "point of no return", but that seems a
> > workaround rather than a fix.
> > If you don't want to kill unnecessary processes, then checking the
> > memcg margin before sending sigkill is better, because as I said
> > before the race will be most likely happening in dump_header().
> > If you don't want to show strange OOM information like "your process
> > was oom killed and it shows usage is 60MB in a memcg limited
> > to 100MB", it is better to get the snapshot of the OOM when it is
> > triggered and then show it later, and I think it could also apply to
> > the global OOM.
> >
>
> It's likely a misunderstanding: I wasn't necessarily concerned about
> showing 60MB in a memcg limited to 100MB, that part we can deal with, the
> concern was after dumping all of that great information that instead of
> getting a "Killed process..." we get a "Oh, there's memory now, just
> kidding about everything we just dumped" ;)
>

Actually the kernel is doing it now, see bellow,

dump_header() <<<< dump lots of information
__oom_kill_process
    p = find_lock_task_mm(victim);
    if (!p)
       return;   <<<< without killing any process.


> We could likely enlighten userspace about that so that we don't consider
> that to be an actual oom kill.  But I also very much agree that after
> dump_header() would be appropriate as well since the goal is to prevent
> unnecessary oom killing.
>
> Would you mind sending a patch to check mem_cgroup_margin() on
> is_memcg_oom() prior to sending the SIGKILL to the victim and printing the
> "Killed process..." line?  We'd need a line that says "xKB of memory now
> available -- suppressing oom kill" or something along those lines so
> userspace understands what happened.  But the memory info that it emits
> both for the state of the memcg and system RAM may also be helpful to
> understand why we got to the oom kill in the first place, which is also a
> good thing.
>
> I'd happy ack that patch since it would be a comprehensive solution that
> avoids oom kill of user processes at all costs, which is a goal I think we
> can all rally behind.

I'd prefer to put dump_header() behind do_send_sig_info(), for example,

__oom_kill_process()
    do_send_sig_info()
    dump_header() <<<< may better put it behind wake_oom_reaper(), but
it may loses some information to dump...
    pr_err("%s: Killed process %d (%s)....")

Because the main goal of OOM is to kill a process to free pages ASAP
to avoid system stall or memcg stall.
We all find that  dump_header() may take a long time to finish
especially if there is a slow console, and this long time may cause a
great system stall, so we'd better defer the dump of it.

But that should be another topic.
Michal Hocko July 16, 2020, 12:21 p.m. UTC | #20
On Thu 16-07-20 19:53:12, Yafang Shao wrote:
[...]
> I'd prefer to put dump_header() behind do_send_sig_info(), for example,
> 
> __oom_kill_process()
>     do_send_sig_info()
>     dump_header() <<<< may better put it behind wake_oom_reaper(), but
> it may loses some information to dump...
>     pr_err("%s: Killed process %d (%s)....")
> 
> Because the main goal of OOM is to kill a process to free pages ASAP
> to avoid system stall or memcg stall.
> We all find that  dump_header() may take a long time to finish
> especially if there is a slow console, and this long time may cause a
> great system stall, so we'd better defer the dump of it.

I would be worried that such a memory dump would be very likely not
reflecting the actual OOM situation and thus hard to use for analysis.
Why? Because the killed oom victim can _usually_ die or the oom reaper
can free up a lot of memory very quickly.

If we look at it the main part of the report which takes long to dump is
dump_tasks because this is potentially a lot of data. I haven't seen
show_mem or its memcg counterpart to take a long time. We've discussed
how much dump_tasks is really useful for production systems and the
general feedback was that it should be enabled by default, though.

If the oom report should be misleading then it would be better to not
bother at all IMHO. From my experience the memory state is really useful
when debugging ooms and having a reasonable snapshot is really
important. IIRC Tetsuo was suggesting collecting all the information at
the time of the oom and print that snapshot later on but the patch was
quite invasive and didn't handle multiple OOMs from different oom
domains - especially the dump_task part would need to link all the tasks
for different oom contexts.

Anyway, I think we are getting off topic here.
Tetsuo Handa July 16, 2020, 1:09 p.m. UTC | #21
On 2020/07/16 21:21, Michal Hocko wrote:
> If the oom report should be misleading then it would be better to not
> bother at all IMHO. From my experience the memory state is really useful
> when debugging ooms and having a reasonable snapshot is really
> important. IIRC Tetsuo was suggesting collecting all the information at
> the time of the oom and print that snapshot later on but the patch was
> quite invasive and didn't handle multiple OOMs from different oom
> domains - especially the dump_task part would need to link all the tasks
> for different oom contexts.

I did, and you just refused as always.

https://lore.kernel.org/linux-mm/20190830103504.GA28313@dhcp22.suse.cz/
https://lore.kernel.org/linux-mm/20190909113627.GJ27159@dhcp22.suse.cz/
David Rientjes July 16, 2020, 7:53 p.m. UTC | #22
On Thu, 16 Jul 2020, Yafang Shao wrote:

> > It's likely a misunderstanding: I wasn't necessarily concerned about
> > showing 60MB in a memcg limited to 100MB, that part we can deal with, the
> > concern was after dumping all of that great information that instead of
> > getting a "Killed process..." we get a "Oh, there's memory now, just
> > kidding about everything we just dumped" ;)
> >
> 
> Actually the kernel is doing it now, see bellow,
> 
> dump_header() <<<< dump lots of information
> __oom_kill_process
>     p = find_lock_task_mm(victim);
>     if (!p)
>        return;   <<<< without killing any process.
> 

Ah, this is catching an instance where the chosen process has already done 
exit_mm(), good catch -- I can find examples of this by scraping kernel 
logs from our fleet.

So it appears there is precedence for dumping all the oom info but not 
actually performing any action for it and I made the earlier point that 
diagnostic information in the kernel log here is still useful.  I think it 
is still preferable that the kernel at least tell us why it didn't do 
anything, but as you mention that already happens today.

Would you like to send a patch that checks for mem_cgroup_margin() here as 
well?  A second patch could make the possible inaction more visibile, 
something like "Process ${pid} (${comm}) is already exiting" for the above 
check or "Memcg ${memcg} is no longer out of memory".

Another thing that these messages indicate, beyond telling us why the oom 
killer didn't actually SIGKILL anything, is that we can expect some skew 
in the memory stats that shows an availability of memory.

 > 
> > We could likely enlighten userspace about that so that we don't consider
> > that to be an actual oom kill.  But I also very much agree that after
> > dump_header() would be appropriate as well since the goal is to prevent
> > unnecessary oom killing.
> >
> > Would you mind sending a patch to check mem_cgroup_margin() on
> > is_memcg_oom() prior to sending the SIGKILL to the victim and printing the
> > "Killed process..." line?  We'd need a line that says "xKB of memory now
> > available -- suppressing oom kill" or something along those lines so
> > userspace understands what happened.  But the memory info that it emits
> > both for the state of the memcg and system RAM may also be helpful to
> > understand why we got to the oom kill in the first place, which is also a
> > good thing.
> >
> > I'd happy ack that patch since it would be a comprehensive solution that
> > avoids oom kill of user processes at all costs, which is a goal I think we
> > can all rally behind.
> 
> I'd prefer to put dump_header() behind do_send_sig_info(), for example,
> 
> __oom_kill_process()
>     do_send_sig_info()
>     dump_header() <<<< may better put it behind wake_oom_reaper(), but
> it may loses some information to dump...
>     pr_err("%s: Killed process %d (%s)....")
> 

I agree with Michal here that dump_header() after the actual kill would no 
longer represent the state of the system (or cpuset or memcg, depending on 
context) at the time of the oom kill so it's best to dump relevant 
information before the actual kill.
David Rientjes July 16, 2020, 8:04 p.m. UTC | #23
On Thu, 16 Jul 2020, Michal Hocko wrote:

> > It's not possible to present data because we've had such a check for years 
> > in our fleet so I can't say that it has prevented X unnecessary oom kills 
> > compared to doing the check prior to calling out_of_memory().  I'm hoping 
> > that can be understood.
> > 
> > Since Yafang is facing the same issue, and there is no significant 
> > downside to doing the mem_cgroup_margin() check prior to 
> > oom_kill_process() (or checking task_will_free_mem(current)), and it's 
> > acknowledged that it *can* prevent unnecessary oom killing, which is a 
> > very good thing, I'd like to understand why such resistance to it.
> 
> Because exactly this kind of arguments has led to quite some "should be
> fine" heuristics which kicked back: do not kill exiting task, sacrifice
> child instead of a victim just to name few. All of them make some sense
> from a glance but they can serious kick back as the experience has
> thought us.
> 
> Really, I do not see what is so hard to understand that each heuristic,
> especially those to subtle areas like oom definitely is, needs data to
> justify them. We are running this for years is really not an argument.
> Sure arguing that your workload leads to x amount of false positives
> and just shifting the check to later saves y amount of them sounds like
> a relevant argument to me.
> 

Deferring the go/no-go decision on the oom kill to the very last moment 
doesn't seem like a heuristic, I think it's an inherent responsibility of 
the kernel to do whatever necessary to prevent a userspace process from 
being oom killed (and the way to solve Yafang's issue that we had solved 
years ago).  That can be done by closing the window as much as possible 
(including within out_of_memory()) to reduce the likelihood of unnecessary 
oom killing.  It's intuitive and seems rather trivial.

I would understand an argument against such an approach if it added 
elaborate complexity, but this isn't doing so.  If the decision was 
already made in oom_kill_process(), I don't think anybody would advocate 
for moving it below out_of_memory() to its current state.  We aren't 
losing anything here, we are only preventing unnecessary oom killing that 
has caused issues for Yafang as well as us.

Any solution that does a mem_cgroup_margin() check before out_of_memory() 
in the memcg path is closing that window a little bit, but I think we can 
do better.
Yafang Shao July 17, 2020, 1:35 a.m. UTC | #24
On Fri, Jul 17, 2020 at 3:53 AM David Rientjes <rientjes@google.com> wrote:
>
> On Thu, 16 Jul 2020, Yafang Shao wrote:
>
> > > It's likely a misunderstanding: I wasn't necessarily concerned about
> > > showing 60MB in a memcg limited to 100MB, that part we can deal with, the
> > > concern was after dumping all of that great information that instead of
> > > getting a "Killed process..." we get a "Oh, there's memory now, just
> > > kidding about everything we just dumped" ;)
> > >
> >
> > Actually the kernel is doing it now, see bellow,
> >
> > dump_header() <<<< dump lots of information
> > __oom_kill_process
> >     p = find_lock_task_mm(victim);
> >     if (!p)
> >        return;   <<<< without killing any process.
> >
>
> Ah, this is catching an instance where the chosen process has already done
> exit_mm(), good catch -- I can find examples of this by scraping kernel
> logs from our fleet.
>
> So it appears there is precedence for dumping all the oom info but not
> actually performing any action for it and I made the earlier point that
> diagnostic information in the kernel log here is still useful.  I think it
> is still preferable that the kernel at least tell us why it didn't do
> anything, but as you mention that already happens today.
>
> Would you like to send a patch that checks for mem_cgroup_margin() here as
> well?  A second patch could make the possible inaction more visibile,
> something like "Process ${pid} (${comm}) is already exiting" for the above
> check or "Memcg ${memcg} is no longer out of memory".
>
> Another thing that these messages indicate, beyond telling us why the oom
> killer didn't actually SIGKILL anything, is that we can expect some skew
> in the memory stats that shows an availability of memory.
>

Agreed, these messages would be helpful.
I will send a patch for it.

>  >
> > > We could likely enlighten userspace about that so that we don't consider
> > > that to be an actual oom kill.  But I also very much agree that after
> > > dump_header() would be appropriate as well since the goal is to prevent
> > > unnecessary oom killing.
> > >
> > > Would you mind sending a patch to check mem_cgroup_margin() on
> > > is_memcg_oom() prior to sending the SIGKILL to the victim and printing the
> > > "Killed process..." line?  We'd need a line that says "xKB of memory now
> > > available -- suppressing oom kill" or something along those lines so
> > > userspace understands what happened.  But the memory info that it emits
> > > both for the state of the memcg and system RAM may also be helpful to
> > > understand why we got to the oom kill in the first place, which is also a
> > > good thing.
> > >
> > > I'd happy ack that patch since it would be a comprehensive solution that
> > > avoids oom kill of user processes at all costs, which is a goal I think we
> > > can all rally behind.
> >
> > I'd prefer to put dump_header() behind do_send_sig_info(), for example,
> >
> > __oom_kill_process()
> >     do_send_sig_info()
> >     dump_header() <<<< may better put it behind wake_oom_reaper(), but
> > it may loses some information to dump...
> >     pr_err("%s: Killed process %d (%s)....")
> >
>
> I agree with Michal here that dump_header() after the actual kill would no
> longer represent the state of the system (or cpuset or memcg, depending on
> context) at the time of the oom kill so it's best to dump relevant
> information before the actual kill.
David Rientjes July 17, 2020, 7:26 p.m. UTC | #25
On Fri, 17 Jul 2020, Yafang Shao wrote:

> > > Actually the kernel is doing it now, see bellow,
> > >
> > > dump_header() <<<< dump lots of information
> > > __oom_kill_process
> > >     p = find_lock_task_mm(victim);
> > >     if (!p)
> > >        return;   <<<< without killing any process.
> > >
> >
> > Ah, this is catching an instance where the chosen process has already done
> > exit_mm(), good catch -- I can find examples of this by scraping kernel
> > logs from our fleet.
> >
> > So it appears there is precedence for dumping all the oom info but not
> > actually performing any action for it and I made the earlier point that
> > diagnostic information in the kernel log here is still useful.  I think it
> > is still preferable that the kernel at least tell us why it didn't do
> > anything, but as you mention that already happens today.
> >
> > Would you like to send a patch that checks for mem_cgroup_margin() here as
> > well?  A second patch could make the possible inaction more visibile,
> > something like "Process ${pid} (${comm}) is already exiting" for the above
> > check or "Memcg ${memcg} is no longer out of memory".
> >
> > Another thing that these messages indicate, beyond telling us why the oom
> > killer didn't actually SIGKILL anything, is that we can expect some skew
> > in the memory stats that shows an availability of memory.
> >
> 
> Agreed, these messages would be helpful.
> I will send a patch for it.
> 

Thanks Yafang.  We should also continue talking about challenges you 
encounter with the oom killer either at the system level or for memcg 
limit ooms in a separate thread.  It's clear that you are meeting several 
of the issues that we have previously seen ourselves.

I could do a full audit of all our oom killer changes that may be
interesting to you, but off the top of my head:

 - A means of triggering a memcg oom through the kernel: think of sysrq+f
   but scoped to processes attached to a memcg hierarchy.  This allows
   userspace to reliably oom kill processes on overcommitted systems
   (SIGKILL can be insufficient if we depend on oom reaping, for example,
   to make forward progress)

 - Storing the state of a memcg's memory at the time reclaim has failed
   and we must oom kill: when the memcg oom killer is disabled so that
   userspace can handle it, if it triggers an oom kill through the kernel
   because it prefers an oom kill on an overcommitted system, we need to
   dump the state of the memory at oom rather than with the stack of the
   explicit trigger
   
 - Supplement memcg oom notification with an additional notification event
   on kernel oom kill: allows users to register for an event that triggers
   when the kernel oom killer kills something (and keeps a count of these
   events available for read)
   
 - Add a notion of an oom delay: on overcommitted systems, userspace may
   become unreliable or unresponsive despite our best efforts, this
   supplements the ability to disable the oom killer for a memcg hierarchy
   with the ability to disable it for a set period of time until the oom
   killer intervenes and kills something (last ditch effort).
   
I'd be happy to discuss any of these topics if you are interested.
Yafang Shao July 18, 2020, 2:15 a.m. UTC | #26
On Sat, Jul 18, 2020 at 3:26 AM David Rientjes <rientjes@google.com> wrote:
>
> On Fri, 17 Jul 2020, Yafang Shao wrote:
>
> > > > Actually the kernel is doing it now, see bellow,
> > > >
> > > > dump_header() <<<< dump lots of information
> > > > __oom_kill_process
> > > >     p = find_lock_task_mm(victim);
> > > >     if (!p)
> > > >        return;   <<<< without killing any process.
> > > >
> > >
> > > Ah, this is catching an instance where the chosen process has already done
> > > exit_mm(), good catch -- I can find examples of this by scraping kernel
> > > logs from our fleet.
> > >
> > > So it appears there is precedence for dumping all the oom info but not
> > > actually performing any action for it and I made the earlier point that
> > > diagnostic information in the kernel log here is still useful.  I think it
> > > is still preferable that the kernel at least tell us why it didn't do
> > > anything, but as you mention that already happens today.
> > >
> > > Would you like to send a patch that checks for mem_cgroup_margin() here as
> > > well?  A second patch could make the possible inaction more visibile,
> > > something like "Process ${pid} (${comm}) is already exiting" for the above
> > > check or "Memcg ${memcg} is no longer out of memory".
> > >
> > > Another thing that these messages indicate, beyond telling us why the oom
> > > killer didn't actually SIGKILL anything, is that we can expect some skew
> > > in the memory stats that shows an availability of memory.
> > >
> >
> > Agreed, these messages would be helpful.
> > I will send a patch for it.
> >
>
> Thanks Yafang.  We should also continue talking about challenges you
> encounter with the oom killer either at the system level or for memcg
> limit ooms in a separate thread.  It's clear that you are meeting several
> of the issues that we have previously seen ourselves.
>
> I could do a full audit of all our oom killer changes that may be
> interesting to you, but off the top of my head:
>
>  - A means of triggering a memcg oom through the kernel: think of sysrq+f
>    but scoped to processes attached to a memcg hierarchy.  This allows
>    userspace to reliably oom kill processes on overcommitted systems
>    (SIGKILL can be insufficient if we depend on oom reaping, for example,
>    to make forward progress)
>

memcg sysrq+f would be helpful.
But I'm wondering how about waking up the oom_reaper when we send
SIGKILL to a process ?

For the below three proposals, I think they would be helpful as well
and I don't have different opinions。

>  - Storing the state of a memcg's memory at the time reclaim has failed
>    and we must oom kill: when the memcg oom killer is disabled so that
>    userspace can handle it, if it triggers an oom kill through the kernel
>    because it prefers an oom kill on an overcommitted system, we need to
>    dump the state of the memory at oom rather than with the stack of the
>    explicit trigger
>
>  - Supplement memcg oom notification with an additional notification event
>    on kernel oom kill: allows users to register for an event that triggers
>    when the kernel oom killer kills something (and keeps a count of these
>    events available for read)
>
>  - Add a notion of an oom delay: on overcommitted systems, userspace may
>    become unreliable or unresponsive despite our best efforts, this
>    supplements the ability to disable the oom killer for a memcg hierarchy
>    with the ability to disable it for a set period of time until the oom
>    killer intervenes and kills something (last ditch effort).
>
> I'd be happy to discuss any of these topics if you are interested.

Pls. send these patches at your convenience.
Johannes Weiner July 28, 2020, 6:04 p.m. UTC | #27
On Wed, Jul 15, 2020 at 11:56:11PM -0700, David Rientjes wrote:
> On Thu, 16 Jul 2020, Michal Hocko wrote:
> 
> > > I don't think moving the mem_cgroup_margin() check to out_of_memory() 
> > > right before printing the oom info and killing the process is a very 
> > > invasive patch.  Any strong preference against doing it that way?  I think 
> > > moving the check as late as possible to save a process from being killed 
> > > when racing with an exiter or killed process (including perhaps current) 
> > > has a pretty clear motivation.
> > 
> > We have been through this discussion several times in the past IIRC
> > The conclusion has been that the allocator (charging path for
> > the memcg) is the one to define OOM situation. This is an inherently
> > racy situation as long as we are not synchronizing oom with the world,
> > which I believe we agree, we do not want to do. There are few exceptions
> > to bail out early from the oom under certain situations and the trend
> > was to remove some of the existing ones rather than adding new because
> > they had subtle side effects and were prone to lockups.
> > 
> > As much as it might sound attractive to move mem_cgroup_margin resp.
> > last allocation attempt closer to the actual oom killing I haven't seen
> > any convincing data that would support that such a change would make a
> > big difference. select_bad_process is not a free operation as it scales
> > with the number of tasks in the oom domain but it shouldn't be a super
> > expensive. The oom reporting is by far the most expensive part of the
> > operation.
> > 
> > That being said, really convincing data should be presented in order
> > to do such a change. I do not think we want to do that just in case.
> 
> It's not possible to present data because we've had such a check for years 
> in our fleet so I can't say that it has prevented X unnecessary oom kills 
> compared to doing the check prior to calling out_of_memory().  I'm hoping 
> that can be understood.

That makes sense. I would just be curious whether you can remember
what data points you looked at at the time you made the change.

Were you inspecting individual OOM kills and saw in the memory dump
after the fact that the kill would have been unnecessary?

Or were you looking at aggregate OOM kill rates in your fleet and saw
a measurable reduction?

> Since Yafang is facing the same issue, and there is no significant 
> downside to doing the mem_cgroup_margin() check prior to 
> oom_kill_process() (or checking task_will_free_mem(current)), and it's 
> acknowledged that it *can* prevent unnecessary oom killing, which is a 
> very good thing, I'd like to understand why such resistance to it.

As someone who has been fairly vocal against this in the past, I have
to admit I don't feel too strongly about it anymore.

My argument in the past has been that if you're going for a
probabilistic reduction of OOM kills, injecting sleeps in between
reclaim and the last allocation attempt could also increase the chance
of coincidental parallel memory frees. And that just always seemed
somewhat arbitrary to me. I was also worried we'd be opening a can of
worms by allowing this type of tweaking of the OOM killer, when OOM
kills are supposed to be a once-in-a-blue-moon deadlock breaker,
rather than a common resource enforcement strategy.

However, I also have to admit that while artificial sleeps would
indeed be pretty arbitrary and likely controversial, the time it takes
to select a victim is unavoidable. It's something we need to do in any
case. If checking the margin one last time after that helps you bring
down the kill rate somewhat, I'd say go for it. It's a natural line,
and I would agree that the change isn't very invasive.

> Killing a user process is a serious matter.  I would fully agree if the 
> margin is only one page: it's still better to kill something off.  But 
> when a process has uncharged memory by means induced by a process waiting 
> on oom notication, such as a userspace kill or dropping of caches from 
> your malloc implementation, that uncharge can be quite substantial and oom 
> killing is then unnecessary.
> 
> I can refresh the patch and send it formally.

No objection from me.
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1962232..15e0e18 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1560,15 +1560,21 @@  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		.gfp_mask = gfp_mask,
 		.order = order,
 	};
-	bool ret;
+	bool ret = true;
 
 	if (mutex_lock_killable(&oom_lock))
 		return true;
+
+	if (mem_cgroup_margin(memcg) >= (1 << order))
+		goto unlock;
+
 	/*
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
 	ret = should_force_charge() || out_of_memory(&oc);
+
+unlock:
 	mutex_unlock(&oom_lock);
 	return ret;
 }