diff mbox series

[v3] mm, oom: prevent soft lockup on memcg oom for UP systems

Message ID alpine.DEB.2.21.2003181458100.70237@chino.kir.corp.google.com (mailing list archive)
State New, archived
Headers show
Series [v3] mm, oom: prevent soft lockup on memcg oom for UP systems | expand

Commit Message

David Rientjes March 18, 2020, 10:03 p.m. UTC
When a process is oom killed as a result of memcg limits and the victim
is waiting to exit, nothing ends up actually yielding the processor back
to the victim on UP systems with preemption disabled.  Instead, the
charging process simply loops in memcg reclaim and eventually soft
lockups.

For example, on an UP system with a memcg limited to 100MB, if three 
processes each charge 40MB of heap with swap disabled, one of the charging 
processes can loop endlessly trying to charge memory which starves the oom 
victim.

Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, 
anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB 
oom_score_adj:0
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
RIP: 0010:shrink_lruvec+0x4e9/0xa40
...
Call Trace:
 shrink_node+0x40d/0x7d0
 do_try_to_free_pages+0x13f/0x470
 try_to_free_mem_cgroup_pages+0x16d/0x230
 try_charge+0x247/0xac0
 mem_cgroup_try_charge+0x10a/0x220
 mem_cgroup_try_charge_delay+0x1e/0x40
 handle_mm_fault+0xdf2/0x15f0
 do_user_addr_fault+0x21f/0x420
 page_fault+0x2f/0x40

Make sure that once the oom killer has been called that we forcibly yield 
if current is not the chosen victim regardless of priority to allow for 
memory freeing.  The same situation can theoretically occur in the page 
allocator, so do this after dropping oom_lock there as well.

We used to have a short sleep after oom killing, but commit 9bfe5ded054b
("mm, oom: remove sleep from under oom_lock") removed it because sleeping
inside the oom_lock is dangerous. This patch restores the sleep outside of
the lock.

Suggested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Tested-by: Robert Kolchmeyer <rkolchmeyer@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/memcontrol.c | 6 ++++++
 mm/page_alloc.c | 6 ++++++
 2 files changed, 12 insertions(+)

Comments

Michal Hocko March 19, 2020, 7:09 a.m. UTC | #1
On Wed 18-03-20 15:03:52, David Rientjes wrote:
> When a process is oom killed as a result of memcg limits and the victim
> is waiting to exit, nothing ends up actually yielding the processor back
> to the victim on UP systems with preemption disabled.  Instead, the
> charging process simply loops in memcg reclaim and eventually soft
> lockups.
> 
> For example, on an UP system with a memcg limited to 100MB, if three 
> processes each charge 40MB of heap with swap disabled, one of the charging 
> processes can loop endlessly trying to charge memory which starves the oom 
> victim.

This only happens if there is no reclaimable memory in the hierarchy.
That is a very specific condition. I do not see any other way than
having a misconfigured system with min protection preventing any
reclaim. Otherwise we have cond_resched both in slab shrinking code
(do_shrink_slab) and LRU shrinking shrink_lruvec. If I am wrong and
those are insufficient then please be explicit about the scenario.

This is a very important information to have in the changelog!

[...]

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1576,6 +1576,12 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	ret = should_force_charge() || out_of_memory(&oc);
>  	mutex_unlock(&oom_lock);
> +        /*
> +         * Give a killed process a good chance to exit before trying to
> +         * charge memory again.
> +         */
> +	if (ret)
> +		schedule_timeout_killable(1);

Why are you making this conditional? Say that there is no victim to
kill. The charge path would simply bail out and it would really depend
on the call chain whether there is a scheduling point or not. Isn't it
simply safer to call schedule_timeout_killable unconditioanlly at this
stage?

>  	return ret;
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3861,6 +3861,12 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	}
>  out:
>  	mutex_unlock(&oom_lock);
> +	/*
> +	 * Give a killed process a good chance to exit before trying to
> +	 * allocate memory again.
> +	 */
> +	if (*did_some_progress)
> +		schedule_timeout_killable(1);

This doesn't make much sense either. Please remember that the primary
reason you are adding this schedule_timeout_killable in this path is
because you want to somehow reduce the priority inversion problem
mentioned by Tetsuo. Because the page allocator path doesn't lack
regular scheduling points - compaction, reclaim and should_reclaim_retry
etc have them.

>  	return page;
>  }
>
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1576,6 +1576,12 @@  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	ret = should_force_charge() || out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
+        /*
+         * Give a killed process a good chance to exit before trying to
+         * charge memory again.
+         */
+	if (ret)
+		schedule_timeout_killable(1);
 	return ret;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3861,6 +3861,12 @@  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 out:
 	mutex_unlock(&oom_lock);
+	/*
+	 * Give a killed process a good chance to exit before trying to
+	 * allocate memory again.
+	 */
+	if (*did_some_progress)
+		schedule_timeout_killable(1);
 	return page;
 }