diff mbox series

[6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages()

Message ID 163157838440.13293.12568710689057349786.stgit@noble.brown (mailing list archive)
State New, archived
Headers show
Series congestion_wait() and GFP_NOFAIL | expand

Commit Message

NeilBrown Sept. 14, 2021, 12:13 a.m. UTC
Documentation commment in gfp.h discourages indefinite retry loops on
ENOMEM and says of __GFP_NOFAIL that it

    is definitely preferable to use the flag rather than opencode
    endless loop around allocator.

congestion_wait() is indistinguishable from
schedule_timeout_uninterruptible() in practice and it is not a good way
to wait for memory to become available.

So instead of waiting, allocate a single page using __GFP_NOFAIL, then
loop around and try to get any more pages that might be needed with a
bulk allocation.  This single-page allocation will wait in the most
appropriate way.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/xfs/xfs_buf.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Dave Chinner Sept. 14, 2021, 2:08 a.m. UTC | #1
On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> Documentation commment in gfp.h discourages indefinite retry loops on
> ENOMEM and says of __GFP_NOFAIL that it
> 
>     is definitely preferable to use the flag rather than opencode
>     endless loop around allocator.
> 
> congestion_wait() is indistinguishable from
> schedule_timeout_uninterruptible() in practice and it is not a good way
> to wait for memory to become available.
> 
> So instead of waiting, allocate a single page using __GFP_NOFAIL, then
> loop around and try to get any more pages that might be needed with a
> bulk allocation.  This single-page allocation will wait in the most
> appropriate way.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/xfs/xfs_buf.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 5fa6cd947dd4..1ae3768f6504 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -372,8 +372,8 @@ xfs_buf_alloc_pages(
>  
>  	/*
>  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> -	 * array is not an allocation failure, so don't back off if we get at
> -	 * least one extra page.
> +	 * array is not an allocation failure, so don't fail or fall back on
> +	 * __GFP_NOFAIL if we get at least one extra page.
>  	 */
>  	for (;;) {
>  		long	last = filled;
> @@ -394,7 +394,7 @@ xfs_buf_alloc_pages(
>  		}
>  
>  		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> +		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);

This smells wrong - the whole point of using the bulk page allocator
in this loop is to avoid the costly individual calls to
alloc_page().

What we are implementing here fail-fast semantics for readahead and
fail-never for everything else.  If the bulk allocator fails to get
a page from the fast path free lists, it already falls back to
__alloc_pages(gfp, 0, ...) to allocate a single page. So AFAICT
there's no need to add another call to alloc_page() because we can
just do this instead:

	if (flags & XBF_READ_AHEAD)
		gfp_mask |= __GFP_NORETRY;
	else
-		gfp_mask |= GFP_NOFS;
+		gfp_mask |= GFP_NOFS | __GFP_NOFAIL;

Which should make the __alloc_pages() call in
alloc_pages_bulk_array() do a __GFP_NOFAIL allocation and hence
provide the necessary never-fail guarantee that is needed here.

At which point, the bulk allocation loop can be simplified because
we can only fail bulk allocation for readahead, so something like:

		if (filled == bp->b_page_count) {
			XFS_STATS_INC(bp->b_mount, xb_page_found);
			break;
		}

-		if (filled != last)
+		if (filled == last) {
-			continue;
-
-		if (flags & XBF_READ_AHEAD) {
			ASSERT(flags & XBF_READ_AHEAD);
			xfs_buf_free_pages(bp);
			return -ENOMEM;
		}

		XFS_STATS_INC(bp->b_mount, xb_page_retries);
-		congestion_wait(BLK_RW_ASYNC, HZ / 50);
	}
	return 0;
}

would do the right thing and still record that we are doing
blocking allocations (via the xb_page_retries stat) in this loop.

Cheers,

Dave.
NeilBrown Sept. 14, 2021, 2:35 a.m. UTC | #2
On Tue, 14 Sep 2021, Dave Chinner wrote:
> On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > Documentation commment in gfp.h discourages indefinite retry loops on
> > ENOMEM and says of __GFP_NOFAIL that it
> > 
> >     is definitely preferable to use the flag rather than opencode
> >     endless loop around allocator.
> > 
> > congestion_wait() is indistinguishable from
> > schedule_timeout_uninterruptible() in practice and it is not a good way
> > to wait for memory to become available.
> > 
> > So instead of waiting, allocate a single page using __GFP_NOFAIL, then
> > loop around and try to get any more pages that might be needed with a
> > bulk allocation.  This single-page allocation will wait in the most
> > appropriate way.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/xfs/xfs_buf.c |    6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 5fa6cd947dd4..1ae3768f6504 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -372,8 +372,8 @@ xfs_buf_alloc_pages(
> >  
> >  	/*
> >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > -	 * array is not an allocation failure, so don't back off if we get at
> > -	 * least one extra page.
> > +	 * array is not an allocation failure, so don't fail or fall back on
> > +	 * __GFP_NOFAIL if we get at least one extra page.
> >  	 */
> >  	for (;;) {
> >  		long	last = filled;
> > @@ -394,7 +394,7 @@ xfs_buf_alloc_pages(
> >  		}
> >  
> >  		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> > -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> > +		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);
> 
> This smells wrong - the whole point of using the bulk page allocator
> in this loop is to avoid the costly individual calls to
> alloc_page().
> 
> What we are implementing here fail-fast semantics for readahead and
> fail-never for everything else.  If the bulk allocator fails to get
> a page from the fast path free lists, it already falls back to
> __alloc_pages(gfp, 0, ...) to allocate a single page. So AFAICT
> there's no need to add another call to alloc_page() because we can
> just do this instead:
> 
> 	if (flags & XBF_READ_AHEAD)
> 		gfp_mask |= __GFP_NORETRY;
> 	else
> -		gfp_mask |= GFP_NOFS;
> +		gfp_mask |= GFP_NOFS | __GFP_NOFAIL;
> 
> Which should make the __alloc_pages() call in
> alloc_pages_bulk_array() do a __GFP_NOFAIL allocation and hence
> provide the necessary never-fail guarantee that is needed here.

That is a nice simplification.
Mel Gorman told me
  https://lore.kernel.org/linux-nfs/20210907153116.GJ3828@suse.com/
that alloc_pages_bulk ignores GFP_NOFAIL.  I added that to the
documentation comment in an earlier patch.

I had a look at the code and cannot see how it would fail to allocate at
least one page.  Maybe Mel can help....

NeilBrown
 


> 
> At which point, the bulk allocation loop can be simplified because
> we can only fail bulk allocation for readahead, so something like:
> 
> 		if (filled == bp->b_page_count) {
> 			XFS_STATS_INC(bp->b_mount, xb_page_found);
> 			break;
> 		}
> 
> -		if (filled != last)
> +		if (filled == last) {
> -			continue;
> -
> -		if (flags & XBF_READ_AHEAD) {
> 			ASSERT(flags & XBF_READ_AHEAD);
> 			xfs_buf_free_pages(bp);
> 			return -ENOMEM;
> 		}
> 
> 		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> 	}
> 	return 0;
> }
> 
> would do the right thing and still record that we are doing
> blocking allocations (via the xb_page_retries stat) in this loop.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
>
Dave Chinner Sept. 14, 2021, 5:33 a.m. UTC | #3
On Tue, Sep 14, 2021 at 12:35:59PM +1000, NeilBrown wrote:
> On Tue, 14 Sep 2021, Dave Chinner wrote:
> > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > Documentation commment in gfp.h discourages indefinite retry loops on
> > > ENOMEM and says of __GFP_NOFAIL that it
> > > 
> > >     is definitely preferable to use the flag rather than opencode
> > >     endless loop around allocator.
> > > 
> > > congestion_wait() is indistinguishable from
> > > schedule_timeout_uninterruptible() in practice and it is not a good way
> > > to wait for memory to become available.
> > > 
> > > So instead of waiting, allocate a single page using __GFP_NOFAIL, then
> > > loop around and try to get any more pages that might be needed with a
> > > bulk allocation.  This single-page allocation will wait in the most
> > > appropriate way.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/xfs/xfs_buf.c |    6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 5fa6cd947dd4..1ae3768f6504 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -372,8 +372,8 @@ xfs_buf_alloc_pages(
> > >  
> > >  	/*
> > >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > > -	 * array is not an allocation failure, so don't back off if we get at
> > > -	 * least one extra page.
> > > +	 * array is not an allocation failure, so don't fail or fall back on
> > > +	 * __GFP_NOFAIL if we get at least one extra page.
> > >  	 */
> > >  	for (;;) {
> > >  		long	last = filled;
> > > @@ -394,7 +394,7 @@ xfs_buf_alloc_pages(
> > >  		}
> > >  
> > >  		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> > > +		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);
> > 
> > This smells wrong - the whole point of using the bulk page allocator
> > in this loop is to avoid the costly individual calls to
> > alloc_page().
> > 
> > What we are implementing here fail-fast semantics for readahead and
> > fail-never for everything else.  If the bulk allocator fails to get
> > a page from the fast path free lists, it already falls back to
> > __alloc_pages(gfp, 0, ...) to allocate a single page. So AFAICT
> > there's no need to add another call to alloc_page() because we can
> > just do this instead:
> > 
> > 	if (flags & XBF_READ_AHEAD)
> > 		gfp_mask |= __GFP_NORETRY;
> > 	else
> > -		gfp_mask |= GFP_NOFS;
> > +		gfp_mask |= GFP_NOFS | __GFP_NOFAIL;
> > 
> > Which should make the __alloc_pages() call in
> > alloc_pages_bulk_array() do a __GFP_NOFAIL allocation and hence
> > provide the necessary never-fail guarantee that is needed here.
> 
> That is a nice simplification.
> Mel Gorman told me
>   https://lore.kernel.org/linux-nfs/20210907153116.GJ3828@suse.com/
> that alloc_pages_bulk ignores GFP_NOFAIL.  I added that to the
> documentation comment in an earlier patch.

Well, that's a surprise to me - I can't see where it masked out
NOFAIL, and it seems quite arbitrary to just say "different code
needs different fallbacks, so you can't have NOFAIL" despite NOFAIL
being the exact behavioural semantics one of only three users of the
bulk allocator really needs...

> I had a look at the code and cannot see how it would fail to allocate at
> least one page.  Maybe Mel can help....

Yup, clarification is definitely needed here.

Cheers,

Dave.
Mel Gorman Sept. 14, 2021, 4:45 p.m. UTC | #4
On Tue, Sep 14, 2021 at 12:35:59PM +1000, NeilBrown wrote:
> On Tue, 14 Sep 2021, Dave Chinner wrote:
> > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > Documentation commment in gfp.h discourages indefinite retry loops on
> > > ENOMEM and says of __GFP_NOFAIL that it
> > > 
> > >     is definitely preferable to use the flag rather than opencode
> > >     endless loop around allocator.
> > > 
> > > congestion_wait() is indistinguishable from
> > > schedule_timeout_uninterruptible() in practice and it is not a good way
> > > to wait for memory to become available.
> > > 
> > > So instead of waiting, allocate a single page using __GFP_NOFAIL, then
> > > loop around and try to get any more pages that might be needed with a
> > > bulk allocation.  This single-page allocation will wait in the most
> > > appropriate way.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/xfs/xfs_buf.c |    6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 5fa6cd947dd4..1ae3768f6504 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -372,8 +372,8 @@ xfs_buf_alloc_pages(
> > >  
> > >  	/*
> > >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > > -	 * array is not an allocation failure, so don't back off if we get at
> > > -	 * least one extra page.
> > > +	 * array is not an allocation failure, so don't fail or fall back on
> > > +	 * __GFP_NOFAIL if we get at least one extra page.
> > >  	 */
> > >  	for (;;) {
> > >  		long	last = filled;
> > > @@ -394,7 +394,7 @@ xfs_buf_alloc_pages(
> > >  		}
> > >  
> > >  		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> > > +		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);
> > 
> > This smells wrong - the whole point of using the bulk page allocator
> > in this loop is to avoid the costly individual calls to
> > alloc_page().
> > 
> > What we are implementing here fail-fast semantics for readahead and
> > fail-never for everything else.  If the bulk allocator fails to get
> > a page from the fast path free lists, it already falls back to
> > __alloc_pages(gfp, 0, ...) to allocate a single page. So AFAICT
> > there's no need to add another call to alloc_page() because we can
> > just do this instead:
> > 
> > 	if (flags & XBF_READ_AHEAD)
> > 		gfp_mask |= __GFP_NORETRY;
> > 	else
> > -		gfp_mask |= GFP_NOFS;
> > +		gfp_mask |= GFP_NOFS | __GFP_NOFAIL;
> > 
> > Which should make the __alloc_pages() call in
> > alloc_pages_bulk_array() do a __GFP_NOFAIL allocation and hence
> > provide the necessary never-fail guarantee that is needed here.
> 
> That is a nice simplification.
> Mel Gorman told me
>   https://lore.kernel.org/linux-nfs/20210907153116.GJ3828@suse.com/
> that alloc_pages_bulk ignores GFP_NOFAIL.  I added that to the
> documentation comment in an earlier patch.
> 
> I had a look at the code and cannot see how it would fail to allocate at
> least one page.  Maybe Mel can help....
> 

If there are already at least one page an the array and the first attempt
at bulk allocation fails, it'll simply return. It's an odd corner case
that may never apply but it's possible.  That said, I'm of the opinion that
__GFP_NOFAIL should not be expanded and instead congestion_wait should be
deleted and replaced with something triggered by reclaim making progress.
NeilBrown Sept. 14, 2021, 9:13 p.m. UTC | #5
On Wed, 15 Sep 2021, Mel Gorman wrote:
> On Tue, Sep 14, 2021 at 12:35:59PM +1000, NeilBrown wrote:
> > On Tue, 14 Sep 2021, Dave Chinner wrote:
> > > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > > Documentation commment in gfp.h discourages indefinite retry loops on
> > > > ENOMEM and says of __GFP_NOFAIL that it
> > > > 
> > > >     is definitely preferable to use the flag rather than opencode
> > > >     endless loop around allocator.
> > > > 
> > > > congestion_wait() is indistinguishable from
> > > > schedule_timeout_uninterruptible() in practice and it is not a good way
> > > > to wait for memory to become available.
> > > > 
> > > > So instead of waiting, allocate a single page using __GFP_NOFAIL, then
> > > > loop around and try to get any more pages that might be needed with a
> > > > bulk allocation.  This single-page allocation will wait in the most
> > > > appropriate way.
> > > > 
> > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > ---
> > > >  fs/xfs/xfs_buf.c |    6 +++---
> > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index 5fa6cd947dd4..1ae3768f6504 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -372,8 +372,8 @@ xfs_buf_alloc_pages(
> > > >  
> > > >  	/*
> > > >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > > > -	 * array is not an allocation failure, so don't back off if we get at
> > > > -	 * least one extra page.
> > > > +	 * array is not an allocation failure, so don't fail or fall back on
> > > > +	 * __GFP_NOFAIL if we get at least one extra page.
> > > >  	 */
> > > >  	for (;;) {
> > > >  		long	last = filled;
> > > > @@ -394,7 +394,7 @@ xfs_buf_alloc_pages(
> > > >  		}
> > > >  
> > > >  		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> > > > -		congestion_wait(BLK_RW_ASYNC, HZ / 50);
> > > > +		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);
> > > 
> > > This smells wrong - the whole point of using the bulk page allocator
> > > in this loop is to avoid the costly individual calls to
> > > alloc_page().
> > > 
> > > What we are implementing here fail-fast semantics for readahead and
> > > fail-never for everything else.  If the bulk allocator fails to get
> > > a page from the fast path free lists, it already falls back to
> > > __alloc_pages(gfp, 0, ...) to allocate a single page. So AFAICT
> > > there's no need to add another call to alloc_page() because we can
> > > just do this instead:
> > > 
> > > 	if (flags & XBF_READ_AHEAD)
> > > 		gfp_mask |= __GFP_NORETRY;
> > > 	else
> > > -		gfp_mask |= GFP_NOFS;
> > > +		gfp_mask |= GFP_NOFS | __GFP_NOFAIL;
> > > 
> > > Which should make the __alloc_pages() call in
> > > alloc_pages_bulk_array() do a __GFP_NOFAIL allocation and hence
> > > provide the necessary never-fail guarantee that is needed here.
> > 
> > That is a nice simplification.
> > Mel Gorman told me
> >   https://lore.kernel.org/linux-nfs/20210907153116.GJ3828@suse.com/
> > that alloc_pages_bulk ignores GFP_NOFAIL.  I added that to the
> > documentation comment in an earlier patch.
> > 
> > I had a look at the code and cannot see how it would fail to allocate at
> > least one page.  Maybe Mel can help....
> > 
> 
> If there are already at least one page an the array and the first attempt
> at bulk allocation fails, it'll simply return. It's an odd corner case
> that may never apply but it's possible.  That said, I'm of the opinion that
> __GFP_NOFAIL should not be expanded and instead congestion_wait should be
> deleted and replaced with something triggered by reclaim making progress.

Ahh.... that was (I think) fixed by
https://patchwork.kernel.org/project/linux-mm/patch/163027609524.7591.4987241695872857175@noble.neil.brown.name/
(which I cannot find on lore.kernel.org - strange)
which you acked - and which I meant to include in this series but
somehow missed.

NeilBrown
diff mbox series

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 5fa6cd947dd4..1ae3768f6504 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -372,8 +372,8 @@  xfs_buf_alloc_pages(
 
 	/*
 	 * Bulk filling of pages can take multiple calls. Not filling the entire
-	 * array is not an allocation failure, so don't back off if we get at
-	 * least one extra page.
+	 * array is not an allocation failure, so don't fail or fall back on
+	 * __GFP_NOFAIL if we get at least one extra page.
 	 */
 	for (;;) {
 		long	last = filled;
@@ -394,7 +394,7 @@  xfs_buf_alloc_pages(
 		}
 
 		XFS_STATS_INC(bp->b_mount, xb_page_retries);
-		congestion_wait(BLK_RW_ASYNC, HZ / 50);
+		bp->b_pages[filled++] = alloc_page(gfp_mask | __GFP_NOFAIL);
 	}
 	return 0;
 }