[RFC,02/26] mm: compaction: avoid GFP_NOFS deadlocks

Message ID	20230418191313.268131-3-hannes@cmpxchg.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Johannes Weiner <hannes@cmpxchg.org> To: linux-mm@kvack.org Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu>, Mel Gorman <mgorman@techsingularity.net>, Vlastimil Babka <vbabka@suse.cz>, David Rientjes <rientjes@google.com>, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH 02/26] mm: compaction: avoid GFP_NOFS deadlocks Date: Tue, 18 Apr 2023 15:12:49 -0400 Message-Id: <20230418191313.268131-3-hannes@cmpxchg.org> In-Reply-To: <20230418191313.268131-1-hannes@cmpxchg.org> References: <20230418191313.268131-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: reliable huge page allocator \| expand [RFC,00/26] mm: reliable huge page allocator [RFC,01/26] block: bdev: blockdev page cache is movable [RFC,02/26] mm: compaction: avoid GFP_NOFS deadlocks [RFC,03/26] mm: make pageblock_order 2M per default [RFC,04/26] mm: page_isolation: write proper kerneldoc [RFC,05/26] mm: page_alloc: per-migratetype pcplist for THPs [RFC,06/26] mm: page_alloc: consolidate free page accounting [RFC,07/26] mm: page_alloc: move capture_control to the page allocator [RFC,08/26] mm: page_alloc: claim blocks during compaction capturing [RFC,09/26] mm: page_alloc: move expand() above compaction_capture() [RFC,10/26] mm: page_alloc: allow compaction capturing from larger blocks [RFC,11/26] mm: page_alloc: introduce MIGRATE_FREE [RFC,12/26] mm: page_alloc: per-migratetype free counts [RFC,13/26] mm: compaction: remove compaction result helpers [RFC,14/26] mm: compaction: simplify should_compact_retry() [RFC,15/26] mm: compaction: simplify free block check in suitable_migration_target() [RFC,16/26] mm: compaction: improve compaction_suitable() accuracy [RFC,17/26] mm: compaction: refactor __compaction_suitable() [RFC,18/26] mm: compaction: remove unnecessary is_via_compact_memory() checks [RFC,19/26] mm: compaction: drop redundant watermark check in compaction_zonelist_suitable() [RFC,20/26] mm: vmscan: use compaction_suitable() check in kswapd [RFC,21/26] mm: compaction: align compaction goals with reclaim goals [RFC,22/26] mm: page_alloc: manage free memory in whole pageblocks [RFC,23/26] mm: page_alloc: kill highatomic [RFC,24/26] mm: page_alloc: kill watermark boosting [RFC,25/26] mm: page_alloc: disallow fallbacks when 2M defrag is enabled [RFC,26/26] mm: page_alloc: add sanity checks for migratetypes

Message ID

20230418191313.268131-3-hannes@cmpxchg.org (mailing list archive)

State

New

Headers

From: Johannes Weiner <hannes@cmpxchg.org>
To: linux-mm@kvack.org
Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	linux-kernel@vger.kernel.org,
	kernel-team@fb.com
Subject: [RFC PATCH 02/26] mm: compaction: avoid GFP_NOFS deadlocks
Date: Tue, 18 Apr 2023 15:12:49 -0400
Message-Id: <20230418191313.268131-3-hannes@cmpxchg.org>
In-Reply-To: <20230418191313.268131-1-hannes@cmpxchg.org>
References: <20230418191313.268131-1-hannes@cmpxchg.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm: reliable huge page allocator | expand

Commit Message

Johannes Weiner April 18, 2023, 7:12 p.m. UTC

During stress testing, two deadlock scenarios were observed:

1. One GFP_NOFS allocation was sleeping on too_many_isolated(), and
   all CPUs were busy with compactors that appeared to be spinning on
   buffer locks.

   Give GFP_NOFS compactors additional isolation headroom, the same
   way we do during reclaim, to eliminate this deadlock scenario.

2. In a more pernicious scenario, the GFP_NOFS allocation was
   busy-spinning in compaction, but seemingly never making
   progress. Upon closer inspection, memory was dominated by file
   pages, which the fs compactor isn't allowed to touch. The remaining
   anon pages didn't have the contiguity to satisfy the request.

   Allow GFP_NOFS allocations to bypass watermarks when compaction
   failed at the highest priority.

While these deadlocks were encountered only in tests with the
subsequent patches (which put a lot more demand on compaction), in
theory these problems already exist in the code today. Fix them now.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/compaction.c | 15 +++++++++++++--
 mm/page_alloc.c | 10 +++++++++-
 2 files changed, 22 insertions(+), 3 deletions(-)

Comments

Mel Gorman April 21, 2023, 12:27 p.m. UTC | #1

On Tue, Apr 18, 2023 at 03:12:49PM -0400, Johannes Weiner wrote:
> During stress testing, two deadlock scenarios were observed:
> 
> 1. One GFP_NOFS allocation was sleeping on too_many_isolated(), and
>    all CPUs were busy with compactors that appeared to be spinning on
>    buffer locks.
> 
>    Give GFP_NOFS compactors additional isolation headroom, the same
>    way we do during reclaim, to eliminate this deadlock scenario.
> 
> 2. In a more pernicious scenario, the GFP_NOFS allocation was
>    busy-spinning in compaction, but seemingly never making
>    progress. Upon closer inspection, memory was dominated by file
>    pages, which the fs compactor isn't allowed to touch. The remaining
>    anon pages didn't have the contiguity to satisfy the request.
> 
>    Allow GFP_NOFS allocations to bypass watermarks when compaction
>    failed at the highest priority.
> 
> While these deadlocks were encountered only in tests with the
> subsequent patches (which put a lot more demand on compaction), in
> theory these problems already exist in the code today. Fix them now.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Definitely needs to be split out.


> ---
>  mm/compaction.c | 15 +++++++++++++--
>  mm/page_alloc.c | 10 +++++++++-
>  2 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 8238e83385a7..84db84e8fd3a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -745,8 +745,9 @@ isolate_freepages_range(struct compact_control *cc,
>  }
>  
>  /* Similar to reclaim, but different enough that they don't share logic */
> -static bool too_many_isolated(pg_data_t *pgdat)
> +static bool too_many_isolated(struct compact_control *cc)
>  {
> +	pg_data_t *pgdat = cc->zone->zone_pgdat;
>  	bool too_many;
>  
>  	unsigned long active, inactive, isolated;
> @@ -758,6 +759,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  	isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
>  			node_page_state(pgdat, NR_ISOLATED_ANON);
>  
> +	/*
> +	 * GFP_NOFS callers are allowed to isolate more pages, so they
> +	 * won't get blocked by normal direct-reclaimers, forming a
> +	 * circular deadlock. GFP_NOIO won't get here.
> +	 */
> +	if (cc->gfp_mask & __GFP_FS) {
> +		inactive >>= 3;
> +		active >>= 3;
> +	}
> +

This comment needs to explain why GFP_NOFS gets special treatment
explaning that a GFP_NOFS context may not be able to migrate pages and
why.

As a follow-up, if GFP_NOFS cannot deal with the majority of the
migration contexts then it should bail out of compaction entirely. The
changelog doesn't say why but maybe SYNC_LIGHT is the issue?

Johannes Weiner April 21, 2023, 2:17 p.m. UTC | #2

On Fri, Apr 21, 2023 at 01:27:43PM +0100, Mel Gorman wrote:
> On Tue, Apr 18, 2023 at 03:12:49PM -0400, Johannes Weiner wrote:
> > During stress testing, two deadlock scenarios were observed:
> > 
> > 1. One GFP_NOFS allocation was sleeping on too_many_isolated(), and
> >    all CPUs were busy with compactors that appeared to be spinning on
> >    buffer locks.
> > 
> >    Give GFP_NOFS compactors additional isolation headroom, the same
> >    way we do during reclaim, to eliminate this deadlock scenario.
> > 
> > 2. In a more pernicious scenario, the GFP_NOFS allocation was
> >    busy-spinning in compaction, but seemingly never making
> >    progress. Upon closer inspection, memory was dominated by file
> >    pages, which the fs compactor isn't allowed to touch. The remaining
> >    anon pages didn't have the contiguity to satisfy the request.
> > 
> >    Allow GFP_NOFS allocations to bypass watermarks when compaction
> >    failed at the highest priority.
> > 
> > While these deadlocks were encountered only in tests with the
> > subsequent patches (which put a lot more demand on compaction), in
> > theory these problems already exist in the code today. Fix them now.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Definitely needs to be split out.

Will do.

> >  mm/compaction.c | 15 +++++++++++++--
> >  mm/page_alloc.c | 10 +++++++++-
> >  2 files changed, 22 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 8238e83385a7..84db84e8fd3a 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -745,8 +745,9 @@ isolate_freepages_range(struct compact_control *cc,
> >  }
> >  
> >  /* Similar to reclaim, but different enough that they don't share logic */
> > -static bool too_many_isolated(pg_data_t *pgdat)
> > +static bool too_many_isolated(struct compact_control *cc)
> >  {
> > +	pg_data_t *pgdat = cc->zone->zone_pgdat;
> >  	bool too_many;
> >  
> >  	unsigned long active, inactive, isolated;
> > @@ -758,6 +759,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >  	isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
> >  			node_page_state(pgdat, NR_ISOLATED_ANON);
> >  
> > +	/*
> > +	 * GFP_NOFS callers are allowed to isolate more pages, so they
> > +	 * won't get blocked by normal direct-reclaimers, forming a
> > +	 * circular deadlock. GFP_NOIO won't get here.
> > +	 */
> > +	if (cc->gfp_mask & __GFP_FS) {
> > +		inactive >>= 3;
> > +		active >>= 3;
> > +	}
> > +
> 
> This comment needs to explain why GFP_NOFS gets special treatment
> explaning that a GFP_NOFS context may not be able to migrate pages and
> why.

Fair point, I'll expand on that.

> As a follow-up, if GFP_NOFS cannot deal with the majority of the
> migration contexts then it should bail out of compaction entirely. The
> changelog doesn't say why but maybe SYNC_LIGHT is the issue?

It's this condition in isolate_migratepages_block():

		/*
		 * Only allow to migrate anonymous pages in GFP_NOFS context
		 * because those do not depend on fs locks.
		 */
		if (!(cc->gfp_mask & __GFP_FS) && mapping)
			goto isolate_fail_put;

In terms of bailing even earlier: We do have per-zone file and anon
counts that could be consulted. However, the real problem is
interleaving of anon and file. Even if only 10% of the zone is anon,
it could still be worth trying to compact if they're relatively
contiguous. OTOH 50% anon could be uncompactable if every block also
contains at least one file. We don't know until we actually scan. I'm
hesitant to give allocations premature access to the last reserves.

What might work is for NOFS contexts to test if anon is low up front
and shortcutting directly to the highest priority (SYNC_FULL). One
good faith scan attempt at least before touching the reserves.

diff --git a/mm/compaction.c b/mm/compaction.c
index 8238e83385a7..84db84e8fd3a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -745,8 +745,9 @@  isolate_freepages_range(struct compact_control *cc,
 }
 
 /* Similar to reclaim, but different enough that they don't share logic */
-static bool too_many_isolated(pg_data_t *pgdat)
+static bool too_many_isolated(struct compact_control *cc)
 {
+	pg_data_t *pgdat = cc->zone->zone_pgdat;
 	bool too_many;
 
 	unsigned long active, inactive, isolated;
@@ -758,6 +759,16 @@  static bool too_many_isolated(pg_data_t *pgdat)
 	isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
 			node_page_state(pgdat, NR_ISOLATED_ANON);
 
+	/*
+	 * GFP_NOFS callers are allowed to isolate more pages, so they
+	 * won't get blocked by normal direct-reclaimers, forming a
+	 * circular deadlock. GFP_NOIO won't get here.
+	 */
+	if (cc->gfp_mask & __GFP_FS) {
+		inactive >>= 3;
+		active >>= 3;
+	}
+
 	too_many = isolated > (inactive + active) / 2;
 	if (!too_many)
 		wake_throttle_isolated(pgdat);
@@ -806,7 +817,7 @@  isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	 * list by either parallel reclaimers or compaction. If there are,
 	 * delay for some time until fewer pages are isolated
 	 */
-	while (unlikely(too_many_isolated(pgdat))) {
+	while (unlikely(too_many_isolated(cc))) {
 		/* stop isolation if there are still pages not migrated */
 		if (cc->nr_migratepages)
 			return -EAGAIN;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bb3484563ed..ac03571e0532 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4508,8 +4508,16 @@  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		prep_new_page(page, order, gfp_mask, alloc_flags);
 
 	/* Try get a page from the freelist if available */
-	if (!page)
+	if (!page) {
+		/*
+		 * It's possible that the only migration sources are
+		 * file pages, and the GFP_NOFS stack is holding up
+		 * other compactors. Use reserves to avoid deadlock.
+		 */
+		if (prio == MIN_COMPACT_PRIORITY && !(gfp_mask & __GFP_FS))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
 		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+	}
 
 	if (page) {
 		struct zone *zone = page_zone(page);

[RFC,02/26] mm: compaction: avoid GFP_NOFS deadlocks

Commit Message

Comments

Patch