diff mbox series

[RFC] mm/oom: detect and kill task which has allocation forbidden by cpuset limit

Message ID 1630399085-70431-1-git-send-email-feng.tang@intel.com (mailing list archive)
State New
Headers show
Series [RFC] mm/oom: detect and kill task which has allocation forbidden by cpuset limit | expand

Commit Message

Feng Tang Aug. 31, 2021, 8:38 a.m. UTC
There was report that starting an Ubuntu in docker while using cpuset
to bind it to movlabe nodes (a node only has movable zone, like a node
for hotplug or a PMEM node in normal usage) will fail due to memory
allocation failure, and then OOM is involved and many other innocent
processes got killed. It can be reproduced with command:
$docker run -it --rm  --cpuset-mems 4 ubuntu:latest bash -c
"grep Mems_allowed /proc/self/status" (node 4 is a movable node)

The reason is, in the case, the target cpuset nodes only have movable
zone, while the creation of an OS in docker sometimes needs to allocate
memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and
the cpuset limit forbids the allocation, then out-of-memory killing is
involved even when normal nodes and movable nodes both have many free
memory.

We've posted patches to LKML trying to make the usage working by
loosening the check, which is not agreed as the cpuset binding should
be respected, and should not be bypassed [1]

But still there is another problem, that when the usage fails as it's an
mission impossible due to the cpuset limit, the allocating should just
be killed first, before any other innocent processes get killed.

Add detection for cases like this, and kill the allocating task only.

[1].https://lore.kernel.org/lkml/1604470210-124827-1-git-send-email-feng.tang@intel.com/

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/oom.h |  1 +
 mm/oom_kill.c       | 13 ++++++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

Comments

Michal Hocko Aug. 31, 2021, 3:57 p.m. UTC | #1
On Tue 31-08-21 16:38:05, Feng Tang wrote:
> There was report that starting an Ubuntu in docker while using cpuset
> to bind it to movlabe nodes (a node only has movable zone, like a node
> for hotplug or a PMEM node in normal usage) will fail due to memory
> allocation failure, and then OOM is involved and many other innocent
> processes got killed. It can be reproduced with command:
> $docker run -it --rm  --cpuset-mems 4 ubuntu:latest bash -c
> "grep Mems_allowed /proc/self/status" (node 4 is a movable node)

Is there any valid usecase to allow cpusets to be configured only to
movable nodes? Wouldn't it be better to simply disallow such a setup?
I do understand that we usually allow people to shoot their feet but
this one has some wider consequences.

> The reason is, in the case, the target cpuset nodes only have movable
> zone, while the creation of an OS in docker sometimes needs to allocate
> memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and
> the cpuset limit forbids the allocation, then out-of-memory killing is
> involved even when normal nodes and movable nodes both have many free
> memory.
> 
> We've posted patches to LKML trying to make the usage working by
> loosening the check, which is not agreed as the cpuset binding should
> be respected, and should not be bypassed [1]
> 
> But still there is another problem, that when the usage fails as it's an
> mission impossible due to the cpuset limit, the allocating should just
> be killed first, before any other innocent processes get killed.

I do not like this solution TBH. We know that that it is impossible to
satisfy the allocation at the page allocator level so dealing with it at
the OOM killer level is just a bad layering and a lot of wasted cycles
to reach that point. Why cannot we simply fail the allocation if cpusets
filtering leads to an empty zone intersection?
David Rientjes Sept. 1, 2021, 1:06 a.m. UTC | #2
On Tue, 31 Aug 2021, Michal Hocko wrote:

> I do not like this solution TBH. We know that that it is impossible to
> satisfy the allocation at the page allocator level so dealing with it at
> the OOM killer level is just a bad layering and a lot of wasted cycles
> to reach that point. Why cannot we simply fail the allocation if cpusets
> filtering leads to an empty zone intersection?

Cpusets will guarantee our effective nodemask will include at least one 
node in N_MEMORY (cpuset_mems_allowed()) so we'll always have at least one 
zone in our zonelist.

Issue in this case appears to be that the zone will never satisfy 
non-movable allocations.  I think this would be very similar to a GFP_DMA 
allocation when bound to a node without lowmem, in which case we get a 
page allocation failure.  We don't kill current like this patch.

So I'd agree in this case that it would be better to simply fail the 
allocation.

Feng, would you move this check to __alloc_pages_may_oom() like the other 
special cases and simply fail rather than call into the oom killer?
Feng Tang Sept. 1, 2021, 2:44 a.m. UTC | #3
Hi David and Michal,

On Tue, Aug 31, 2021 at 06:06:17PM -0700, David Rientjes wrote:
> On Tue, 31 Aug 2021, Michal Hocko wrote:
> 
> > I do not like this solution TBH. We know that that it is impossible to
> > satisfy the allocation at the page allocator level so dealing with it at
> > the OOM killer level is just a bad layering and a lot of wasted cycles
> > to reach that point. Why cannot we simply fail the allocation if cpusets
> > filtering leads to an empty zone intersection?
> 
> Cpusets will guarantee our effective nodemask will include at least one 
> node in N_MEMORY (cpuset_mems_allowed()) so we'll always have at least one 
> zone in our zonelist.
> 
> Issue in this case appears to be that the zone will never satisfy 
> non-movable allocations.  I think this would be very similar to a GFP_DMA 
> allocation when bound to a node without lowmem, in which case we get a 
> page allocation failure.  We don't kill current like this patch.
 
Thanks for sharing the case, the DMA case is quite simliar. And in our usage,
the allocating task is finally killed after many OS routine/GUI tasks get
killed.

> So I'd agree in this case that it would be better to simply fail the 
> allocation.

I agree with yours and Michal's comments, putting it in the OOM code
is a little late and wastes cpu cycles.

> Feng, would you move this check to __alloc_pages_may_oom() like the other 
> special cases and simply fail rather than call into the oom killer?

Will explore more in this direction, thanks!

- Feng
Feng Tang Sept. 1, 2021, 1:42 p.m. UTC | #4
On Wed, Sep 01, 2021 at 10:44:02AM +0800, Tang, Feng wrote:
[SNIP]
> > So I'd agree in this case that it would be better to simply fail the 
> > allocation.
> 
> I agree with yours and Michal's comments, putting it in the OOM code
> is a little late and wastes cpu cycles.
> 
> > Feng, would you move this check to __alloc_pages_may_oom() like the other 
> > special cases and simply fail rather than call into the oom killer?
> 
> Will explore more in this direction, thanks!
 
I tried below patch, which can solve the blindly killing issue, that
the docker processes will see page allocation errors, and eventually 
quit running.

Thanks,
Feng

---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeb3a9cb36bb..d1ae77be45a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4271,10 +4271,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		.gfp_mask = gfp_mask,
 		.order = order,
 	};
-	struct page *page;
+	struct page *page = NULL;
+	struct zoneref *z;
 
 	*did_some_progress = 0;
 
+	if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) {
+		z = first_zones_zonelist(ac->zonelist,
+			gfp_zone(gfp_mask), &cpuset_current_mems_allowed);
+		if (!z->zone)
+			goto out;
+	}
+
 	/*
 	 * Acquire the oom lock.  If that fails, somebody else is
 	 * making progress for us.
Michal Hocko Sept. 1, 2021, 2:05 p.m. UTC | #5
On Wed 01-09-21 21:42:00, Feng Tang wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eeb3a9cb36bb..d1ae77be45a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4271,10 +4271,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
>  	};
> -	struct page *page;
> +	struct page *page = NULL;
> +	struct zoneref *z;
>  
>  	*did_some_progress = 0;
>  
> +	if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) {
> +		z = first_zones_zonelist(ac->zonelist,
> +			gfp_zone(gfp_mask), &cpuset_current_mems_allowed);
> +		if (!z->zone)
> +			goto out;
> +	}
> +

This looks better than the previous attempt. It would be still better to
solve this at the page allocator layer. The slowpath is already doing
this for the nodemask. E.g.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeb3a9cb36bb..a3193134540d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4929,6 +4929,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (!ac->preferred_zoneref->zone)
 		goto nopage;
 
+	/*
+	 * Check for insane configurations where the cpuset doesn't contain any suitable
+	 * zone to satisfy the request - e.g. kernel allocations from MOVABLE nodes only
+	 */
+	if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) {
+		struct zoneref *z = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx,
+				&cpuset_current_mems_allowed);
+		if (!z->zone)
+			goto nopage;
+	}
+
 	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);
 
if this is seen as an additional overhead for an insane configuration
then we can add insane_cpusets_enabled() which would be a static branch
enabled when somebody actually tries to configure movable only cpusets
or potentially other dubious usage.
Feng Tang Sept. 2, 2021, 7:34 a.m. UTC | #6
On Wed, Sep 01, 2021 at 04:05:40PM +0200, Michal Hocko wrote:
[SNIP]
> 
> This looks better than the previous attempt. It would be still better to
> solve this at the page allocator layer. The slowpath is already doing
> this for the nodemask. E.g.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eeb3a9cb36bb..a3193134540d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4929,6 +4929,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (!ac->preferred_zoneref->zone)
>  		goto nopage;
>  
> +	/*
> +	 * Check for insane configurations where the cpuset doesn't contain any suitable
> +	 * zone to satisfy the request - e.g. kernel allocations from MOVABLE nodes only
> +	 */
> +	if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) {
> +		struct zoneref *z = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx,
> +				&cpuset_current_mems_allowed);
> +		if (!z->zone)
> +			goto nopage;
> +	}
> +
>  	if (alloc_flags & ALLOC_KSWAPD)
>  		wake_all_kswapds(order, gfp_mask, ac);

Thanks for the suggestion! It dose bail out early skipping the kswapd,
direct reclaim and compaction.

I also looked at prepare_alloc_pages() which does some cpuset check
and zone initialization, but I'd better leave it alone as it's in a
real hot path, while here is in slowpath anyway. 

Will run some page fault benchmark cases with this patch.

Thanks,
Feng

> if this is seen as an additional overhead for an insane configuration
> then we can add insane_cpusets_enabled() which would be a static branch
> enabled when somebody actually tries to configure movable only cpusets
> or potentially other dubious usage.
> -- 
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..bf470d8cc421 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -18,6 +18,7 @@  struct task_struct;
 enum oom_constraint {
 	CONSTRAINT_NONE,
 	CONSTRAINT_CPUSET,
+	CONSTRAINT_CPUSET_NONE,	/* no available zone from cpuset's mem nodes */
 	CONSTRAINT_MEMORY_POLICY,
 	CONSTRAINT_MEMCG,
 };
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 431d38c3bba8..021ec8954279 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -247,6 +247,7 @@  long oom_badness(struct task_struct *p, unsigned long totalpages)
 static const char * const oom_constraint_text[] = {
 	[CONSTRAINT_NONE] = "CONSTRAINT_NONE",
 	[CONSTRAINT_CPUSET] = "CONSTRAINT_CPUSET",
+	[CONSTRAINT_CPUSET_NONE] = "CONSTRAINT_CPUSET_NONE",
 	[CONSTRAINT_MEMORY_POLICY] = "CONSTRAINT_MEMORY_POLICY",
 	[CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG",
 };
@@ -275,6 +276,14 @@  static enum oom_constraint constrained_alloc(struct oom_control *oc)
 
 	if (!oc->zonelist)
 		return CONSTRAINT_NONE;
+
+	if (cpusets_enabled() && (oc->gfp_mask & __GFP_HARDWALL)) {
+		z = first_zones_zonelist(oc->zonelist,
+			highest_zoneidx, &cpuset_current_mems_allowed);
+		if (!z->zone)
+			return CONSTRAINT_CPUSET_NONE;
+	}
+
 	/*
 	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
 	 * to kill current.We have to random task kill in this case.
@@ -1093,7 +1102,9 @@  bool out_of_memory(struct oom_control *oc)
 		oc->nodemask = NULL;
 	check_panic_on_oom(oc);
 
-	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
+	if (!is_memcg_oom(oc) &&
+	    (sysctl_oom_kill_allocating_task ||
+	       oc->constraint == CONSTRAINT_CPUSET_NONE) &&
 	    current->mm && !oom_unkillable_task(current) &&
 	    oom_cpuset_eligible(current, oc) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {