[5/6] XFS: remove congestion_wait() loop from kmem_alloc()

Message ID	163157838439.13293.5032214643474179966.stgit@noble.brown (mailing list archive)
State	Superseded, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: NeilBrown <neilb@suse.de> To: Andrew Morton <akpm@linux-foundation.org>, Theodore Ts'o <tytso@mit.edu>, Andreas Dilger <adilger.kernel@dilger.ca>, "Darrick J. Wong" <djwong@kernel.org>, Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.com> Date: Tue, 14 Sep 2021 10:13:04 +1000 Subject: [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() Cc: linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Message-ID: <163157838439.13293.5032214643474179966.stgit@noble.brown> In-Reply-To: <163157808321.13293.486682642188075090.stgit@noble.brown> References: <163157808321.13293.486682642188075090.stgit@noble.brown> User-Agent: StGit/0.23 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	congestion_wait() and GFP_NOFAIL \| expand [0/6] congestion_wait() and GFP_NOFAIL [1/6] MM: improve documentation for __GFP_NOFAIL [2/6] MM: annotate congestion_wait() and wait_iff_congested() as ineffective. [3/6] EXT4: Remove ENOMEM/congestion_wait() loops. [4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify [5/6] XFS: remove congestion_wait() loop from kmem_alloc() [6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages()

Message ID

163157838439.13293.5032214643474179966.stgit@noble.brown (mailing list archive)

State

Superseded, archived

Headers

From: NeilBrown <neilb@suse.de>
To: Andrew Morton <akpm@linux-foundation.org>,
        Theodore Ts'o <tytso@mit.edu>,
        Andreas Dilger <adilger.kernel@dilger.ca>,
        "Darrick J. Wong" <djwong@kernel.org>,
        Matthew Wilcox <willy@infradead.org>,
        Mel Gorman <mgorman@suse.com>
Date: Tue, 14 Sep 2021 10:13:04 +1000
Subject: [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc()
Cc: linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org,
        linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org,
        linux-mm@kvack.org, linux-kernel@vger.kernel.org
Message-ID: <163157838439.13293.5032214643474179966.stgit@noble.brown>
In-Reply-To: <163157808321.13293.486682642188075090.stgit@noble.brown>
References: <163157808321.13293.486682642188075090.stgit@noble.brown>
User-Agent: StGit/0.23
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Precedence: bulk

Series

congestion_wait() and GFP_NOFAIL | expand

Commit Message

NeilBrown Sept. 14, 2021, 12:13 a.m. UTC

Documentation commment in gfp.h discourages indefinite retry loops on
ENOMEM and says of __GFP_NOFAIL that it

    is definitely preferable to use the flag rather than opencode
    endless loop around allocator.

So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
not given.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/xfs/kmem.c |   16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

Comments

Dave Chinner Sept. 14, 2021, 1:31 a.m. UTC | #1

On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> Documentation commment in gfp.h discourages indefinite retry loops on
> ENOMEM and says of __GFP_NOFAIL that it
> 
>     is definitely preferable to use the flag rather than opencode
>     endless loop around allocator.
> 
> So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
> not given.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/xfs/kmem.c |   16 ++++------------
>  1 file changed, 4 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 6f49bf39183c..f545f3633f88 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -13,19 +13,11 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  {
>  	int	retries = 0;
>  	gfp_t	lflags = kmem_flags_convert(flags);
> -	void	*ptr;
>  
>  	trace_kmem_alloc(size, flags, _RET_IP_);
>  
> -	do {
> -		ptr = kmalloc(size, lflags);
> -		if (ptr || (flags & KM_MAYFAIL))
> -			return ptr;
> -		if (!(++retries % 100))
> -			xfs_err(NULL,
> -	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
> -				current->comm, current->pid,
> -				(unsigned int)size, __func__, lflags);
> -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> -	} while (1);
> +	if (!(flags & KM_MAYFAIL))
> +		lflags |= __GFP_NOFAIL;
> +
> +	return kmalloc(size, lflags);
>  }

Which means we no longer get warnings about memory allocation
failing - kmem_flags_convert() sets __GFP_NOWARN for all allocations
in this loop. Hence we'll now get silent deadlocks through this code
instead of getting warnings that memory allocation is failing
repeatedly.

I also wonder about changing the backoff behaviour here (it's a 20ms
wait right now because there are not early wakeups) will affect the
behaviour, as __GFP_NOFAIL won't wait for that extra time between
allocation attempts....

And, of course, how did you test this? Sometimes we see
unpredicted behaviours as a result of "simple" changes like this
under low memory conditions...

Cheers,

Dave.

NeilBrown Sept. 14, 2021, 3:27 a.m. UTC | #2

On Tue, 14 Sep 2021, Dave Chinner wrote:
> On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > Documentation commment in gfp.h discourages indefinite retry loops on
> > ENOMEM and says of __GFP_NOFAIL that it
> > 
> >     is definitely preferable to use the flag rather than opencode
> >     endless loop around allocator.
> > 
> > So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
> > not given.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/xfs/kmem.c |   16 ++++------------
> >  1 file changed, 4 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 6f49bf39183c..f545f3633f88 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -13,19 +13,11 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  {
> >  	int	retries = 0;
> >  	gfp_t	lflags = kmem_flags_convert(flags);
> > -	void	*ptr;
> >  
> >  	trace_kmem_alloc(size, flags, _RET_IP_);
> >  
> > -	do {
> > -		ptr = kmalloc(size, lflags);
> > -		if (ptr || (flags & KM_MAYFAIL))
> > -			return ptr;
> > -		if (!(++retries % 100))
> > -			xfs_err(NULL,
> > -	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
> > -				current->comm, current->pid,
> > -				(unsigned int)size, __func__, lflags);
> > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > -	} while (1);
> > +	if (!(flags & KM_MAYFAIL))
> > +		lflags |= __GFP_NOFAIL;
> > +
> > +	return kmalloc(size, lflags);
> >  }
> 
> Which means we no longer get warnings about memory allocation
> failing - kmem_flags_convert() sets __GFP_NOWARN for all allocations
> in this loop. Hence we'll now get silent deadlocks through this code
> instead of getting warnings that memory allocation is failing
> repeatedly.

Yes, that is a problem.  Could we just clear __GFP_NOWARN when setting
__GFP_NOFAIL?
Or is the 1-in-100 important? I think default warning is 1 every 10
seconds.

> 
> I also wonder about changing the backoff behaviour here (it's a 20ms
> wait right now because there are not early wakeups) will affect the
> behaviour, as __GFP_NOFAIL won't wait for that extra time between
> allocation attempts....

The internal backoff is 100ms if there is much pending writeout, and
there are 16 internal retries.  If there is not much pending writeout, I
think it just loops with cond_resched().
So adding 20ms can only be at all interesting when the only way to
reclaim memory is something other than writeout.  I don't know how to
think about that.

> 
> And, of course, how did you test this? Sometimes we see
> unpredicted behaviours as a result of "simple" changes like this
> under low memory conditions...

I suspect this is close to untestable.  While I accept that there might
be a scenario where the change might cause some macro effect, it would
most likely be some interplay with some other subsystem struggling with
memory.  Testing XFS by itself would be unlikely to find it.

Thanks,
NeilBrown


> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
>

Dave Chinner Sept. 14, 2021, 6:05 a.m. UTC | #3

On Tue, Sep 14, 2021 at 01:27:31PM +1000, NeilBrown wrote:
> On Tue, 14 Sep 2021, Dave Chinner wrote:
> > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > Documentation commment in gfp.h discourages indefinite retry loops on
> > > ENOMEM and says of __GFP_NOFAIL that it
> > > 
> > >     is definitely preferable to use the flag rather than opencode
> > >     endless loop around allocator.
> > > 
> > > So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
> > > not given.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/xfs/kmem.c |   16 ++++------------
> > >  1 file changed, 4 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index 6f49bf39183c..f545f3633f88 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > > @@ -13,19 +13,11 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> > >  {
> > >  	int	retries = 0;
> > >  	gfp_t	lflags = kmem_flags_convert(flags);
> > > -	void	*ptr;
> > >  
> > >  	trace_kmem_alloc(size, flags, _RET_IP_);
> > >  
> > > -	do {
> > > -		ptr = kmalloc(size, lflags);
> > > -		if (ptr || (flags & KM_MAYFAIL))
> > > -			return ptr;
> > > -		if (!(++retries % 100))
> > > -			xfs_err(NULL,
> > > -	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
> > > -				current->comm, current->pid,
> > > -				(unsigned int)size, __func__, lflags);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > -	} while (1);
> > > +	if (!(flags & KM_MAYFAIL))
> > > +		lflags |= __GFP_NOFAIL;
> > > +
> > > +	return kmalloc(size, lflags);
> > >  }
> > 
> > Which means we no longer get warnings about memory allocation
> > failing - kmem_flags_convert() sets __GFP_NOWARN for all allocations
> > in this loop. Hence we'll now get silent deadlocks through this code
> > instead of getting warnings that memory allocation is failing
> > repeatedly.
> 
> Yes, that is a problem.  Could we just clear __GFP_NOWARN when setting
> __GFP_NOFAIL?

Probably.

> Or is the 1-in-100 important? I think default warning is 1 every 10
> seconds.

1-in-100 is an arbitrary number to prevent spamming of logs unless
there is a real likelihood of there being a memory allocation
deadlock. We've typically only ever seen this when trying to do
high-order allocations (e.g. 64kB for xattr buffers) and failing
repeatedly in extreme memory pressure events. It's a canary that we
leave in the logs so that when a user reports problems we know that
they've been running under extended extreme low memory conditions
and can adjust the triage process accordingly.

So, we could remove __GFP_NOWARN, as long as the core allocator code
has sufficient rate limiting that it won't spam the logs due to
extended failure looping...

> > I also wonder about changing the backoff behaviour here (it's a 20ms
> > wait right now because there are not early wakeups) will affect the
> > behaviour, as __GFP_NOFAIL won't wait for that extra time between
> > allocation attempts....
> 
> The internal backoff is 100ms if there is much pending writeout, and
> there are 16 internal retries.  If there is not much pending writeout, I
> think it just loops with cond_resched().
> So adding 20ms can only be at all interesting when the only way to
> reclaim memory is something other than writeout.  I don't know how to
> think about that.

Any cache that uses a shrinker to reclaim (e.g. dentry, inodes, fs
metadata, etc due to recursive directory traversals) can cause
reclaim looping and priority escalation without there being any page
cache writeback or reclaim possible. Especially when you have
GFP_NOFS allocation context and all your memory is in VFS level
caches. At that point, direct reclaim cannot (and will not) make
forwards progress, so we still have to wait for some other
GFP_KERNEL context reclaim (e.g. kswapd) to make progress reclaiming
memory while we wait.

Fundamentally, the memory reclaim backoff code doesn't play well
with shrinkers. Patches from an old patchset which pushed lack of
shrinker progress back up into the vmscan level backoff algorithms
was something I was experimenting with a few years ago. e.g.

https://lore.kernel.org/linux-xfs/20191031234618.15403-16-david@fromorbit.com/
https://lore.kernel.org/linux-xfs/20191031234618.15403-17-david@fromorbit.com/

We didn't end up going this way to solve the XFS inode reclaim
problems - I ended up solving that entirely by pinning XFS buffer
cache memory and modifying the XFS inode shrinker - but it was this
patchset that first exposed the fact that congestion_wait() was no
longer functioning as intended. See the last few paragraphs of the
(long) cover letter for v1 of that patchset here:

https://lore.kernel.org/linux-xfs/20190801021752.4986-1-david@fromorbit.com/

So, yeah, I know full well that congestion_wait() is mostly just
an unconditional timeout these days...

> > And, of course, how did you test this? Sometimes we see
> > unpredicted behaviours as a result of "simple" changes like this
> > under low memory conditions...
> 
> I suspect this is close to untestable.  While I accept that there might
> be a scenario where the change might cause some macro effect, it would
> most likely be some interplay with some other subsystem struggling with
> memory.  Testing XFS by itself would be unlikely to find it.

Filesystem traversal workloads (e.g. chown -R) are the ones that
hammer memory allocation from GFP_NOFS context which creates memory
pressure that cannot be balanced by direct reclaim as direct reclaim
cannot reclaim filesystem caches in this situation. This is where I
would expect extra backoff on failing GFP_NOFS allocations to have
some effect...

Cheers,

Dave.

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 6f49bf39183c..f545f3633f88 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -13,19 +13,11 @@  kmem_alloc(size_t size, xfs_km_flags_t flags)
 {
 	int	retries = 0;
 	gfp_t	lflags = kmem_flags_convert(flags);
-	void	*ptr;
 
 	trace_kmem_alloc(size, flags, _RET_IP_);
 
-	do {
-		ptr = kmalloc(size, lflags);
-		if (ptr || (flags & KM_MAYFAIL))
-			return ptr;
-		if (!(++retries % 100))
-			xfs_err(NULL,
-	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
-				current->comm, current->pid,
-				(unsigned int)size, __func__, lflags);
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-	} while (1);
+	if (!(flags & KM_MAYFAIL))
+		lflags |= __GFP_NOFAIL;
+
+	return kmalloc(size, lflags);
 }