diff mbox series

btrfs: fallback to single page allocation to avoid bulk allocation latency

Message ID 9ec01e023cc34e5729dd4a86affd5158f87c7a83.1711311627.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fallback to single page allocation to avoid bulk allocation latency | expand

Commit Message

Qu Wenruo March 24, 2024, 8:20 p.m. UTC
[BUG]
There is a recent report that when memory pressure is high (including
cached pages), btrfs can spend most of its time on memory allocation in
btrfs_alloc_page_array().

[CAUSE]
For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and
even if the bulk allocation failed we still retry but with extra
memalloc_retry_wait().

If the bulk alloc only returned one page a time, we would spend a lot of
time on the retry wait.

[FIX]
Instead of always trying the same bulk allocation, fallback to single
page allocation if the initial bulk allocation attempt doesn't fill the
whole request.

Even if this means a higher chance of memory allocation failure.

Reported-by: Julian Taylor <julian.taylor@1und1.de>
Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 35 ++++++++++++-----------------------
 1 file changed, 12 insertions(+), 23 deletions(-)

Comments

Filipe Manana March 25, 2024, 12:32 p.m. UTC | #1
On Sun, Mar 24, 2024 at 8:21 PM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> There is a recent report that when memory pressure is high (including
> cached pages), btrfs can spend most of its time on memory allocation in
> btrfs_alloc_page_array().
>
> [CAUSE]
> For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and
> even if the bulk allocation failed we still retry but with extra
> memalloc_retry_wait().
>
> If the bulk alloc only returned one page a time, we would spend a lot of
> time on the retry wait.
>
> [FIX]
> Instead of always trying the same bulk allocation, fallback to single
> page allocation if the initial bulk allocation attempt doesn't fill the
> whole request.
>
> Even if this means a higher chance of memory allocation failure.

Well, this is a bit confusing, this last sentence.
I mean the chances are the same we had before commit 91d6ac1d62c3
("btrfs: allocate page arrays using bulk page allocator").

>
> Reported-by: Julian Taylor <julian.taylor@1und1.de>
> Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/

As Julian seems to have tested this and confirmed it solves the
regression for him, a Tested-by tag should be here.

Also given that this seems like a huge performance drop, a Fixes tag
would be appropriate here.

Just one comment about the code below.

> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 35 ++++++++++++-----------------------
>  1 file changed, 12 insertions(+), 23 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 7441245b1ceb..d49e7f0384ed 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -681,33 +681,22 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
>  int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
>                            gfp_t extra_gfp)
>  {
> +       const gfp_t gfp = GFP_NOFS | extra_gfp;
>         unsigned int allocated;
>
> -       for (allocated = 0; allocated < nr_pages;) {
> -               unsigned int last = allocated;
> -
> -               allocated = alloc_pages_bulk_array(GFP_NOFS | extra_gfp,
> -                                                  nr_pages, page_array);
> -
> -               if (allocated == nr_pages)
> -                       return 0;
> -
> -               /*
> -                * During this iteration, no page could be allocated, even
> -                * though alloc_pages_bulk_array() falls back to alloc_page()
> -                * if  it could not bulk-allocate. So we must be out of memory.
> -                */
> -               if (allocated == last) {
> -                       for (int i = 0; i < allocated; i++) {
> -                               __free_page(page_array[i]);
> -                               page_array[i] = NULL;
> -                       }
> -                       return -ENOMEM;
> -               }
> -
> -               memalloc_retry_wait(GFP_NOFS);
> +       allocated = alloc_pages_bulk_array(GFP_NOFS | gfp, nr_pages, page_array);

No need to add GFP_NOFS, the gpg variable already has it.

Otherwise it looks good, so with that adjusted:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> +       for (; allocated < nr_pages; allocated++) {
> +               page_array[allocated] = alloc_page(gfp);
> +               if (unlikely(!page_array[allocated]))
> +                       goto enomem;
>         }
>         return 0;
> +enomem:
> +       for (int i = 0; i < allocated; i++) {
> +               __free_page(page_array[i]);
> +               page_array[i] = NULL;
> +       }
> +       return -ENOMEM;
>  }
>
>  /*
> --
> 2.44.0
>
>
Sweet Tea Dorminy March 25, 2024, 9:18 p.m. UTC | #2
On 3/24/24 16:20, Qu Wenruo wrote:
> [BUG]
> There is a recent report that when memory pressure is high (including
> cached pages), btrfs can spend most of its time on memory allocation in
> btrfs_alloc_page_array().
> 
> [CAUSE]
> For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and
> even if the bulk allocation failed we still retry but with extra
> memalloc_retry_wait().
> 
> If the bulk alloc only returned one page a time, we would spend a lot of
> time on the retry wait.
> 
> [FIX]
> Instead of always trying the same bulk allocation, fallback to single
> page allocation if the initial bulk allocation attempt doesn't fill the
> whole request.


I still think the argument in 395cb57e85604 is reasonably compelling: 
that it's polite and normative among filesystems to throw in a 
retry_wait() between attempts to allocate, and I'm not sure why we're 
seeing this performance impact where others seemingly don't.

But your change does a lot more than just cutting out the retry_wait(); 
it only tries the bulk allocator once. Given the bulk allocator falls 
back to the single-page allocator itself and can allocate multiple pages 
on any given iteration, I think it's better to just cut out the 
retry_wait() if we must, in hopes we can allocate more than one when 
retrying?

Or, could we add in GFP_RETRY_MAYFAIL into the btrfs_alloc_page_array() 
call in compression.c instead, so that page reclaim is started if the 
initial alloc_pages_bulk_array() doesn't work out?
Qu Wenruo March 25, 2024, 9:38 p.m. UTC | #3
在 2024/3/26 07:48, Sweet Tea Dorminy 写道:
> 
> 
> On 3/24/24 16:20, Qu Wenruo wrote:
>> [BUG]
>> There is a recent report that when memory pressure is high (including
>> cached pages), btrfs can spend most of its time on memory allocation in
>> btrfs_alloc_page_array().
>>
>> [CAUSE]
>> For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and
>> even if the bulk allocation failed we still retry but with extra
>> memalloc_retry_wait().
>>
>> If the bulk alloc only returned one page a time, we would spend a lot of
>> time on the retry wait.
>>
>> [FIX]
>> Instead of always trying the same bulk allocation, fallback to single
>> page allocation if the initial bulk allocation attempt doesn't fill the
>> whole request.
> 
> 
> I still think the argument in 395cb57e85604 is reasonably compelling: 
> that it's polite and normative among filesystems to throw in a 
> retry_wait() between attempts to allocate, and I'm not sure why we're 
> seeing this performance impact where others seemingly don't.

The reporter is hitting this during compression, meanwhile no other 
major fs (until recent bcachefs?) supports compression.

Thus I guess we have a corner case where other fses don't?

> 
> But your change does a lot more than just cutting out the retry_wait(); 
> it only tries the bulk allocator once. Given the bulk allocator falls 
> back to the single-page allocator itself and can allocate multiple pages 
> on any given iteration, I think it's better to just cut out the 
> retry_wait() if we must, in hopes we can allocate more than one when 
> retrying?
> 
> Or, could we add in GFP_RETRY_MAYFAIL into the btrfs_alloc_page_array() 
> call in compression.c instead, so that page reclaim is started if the 
> initial alloc_pages_bulk_array() doesn't work out?

My question is, I don't have a good reference on why we should retry 
wait even for successful allocation.


In f2fs/ext4/XFS, the memalloc_retry_wait() is only called for page 
allocation failure (aka, no allocation is done at all), not after each 
bulk allocation unconditionally.

Thus I'm wondering if it's really a policy/common practice to do the 
wait for successful allocation at all.

In that case, I'm totally fine to only call retry wait if we are unable 
to allocate any page, other than unconditionally.

Thanks,
Qu
Sweet Tea Dorminy March 25, 2024, 10:54 p.m. UTC | #4
> 
> The reporter is hitting this during compression, meanwhile no other 
> major fs (until recent bcachefs?) supports compression.
> 
> Thus I guess we have a corner case where other fses don't?

Makes sense. And since read success depends on it, it's latency 
sensitive, so I think that could be a justification if we did eventually 
want to throw GFP_RETRY_MAYFAIL in the flags.

> 
> My question is, I don't have a good reference on why we should retry 
> wait even for successful allocation.
> 
> In f2fs/ext4/XFS, the memalloc_retry_wait() is only called for page 
> allocation failure (aka, no allocation is done at all), not after each 
> bulk allocation unconditionally.
> 
> Thus I'm wondering if it's really a policy/common practice to do the 
> wait for successful allocation at all.

You make a good point. 
https://elixir.bootlin.com/linux/latest/source/include/linux/sched/mm.h#L267 
seems to indicate one should use it on every retry of mem allocation, 
but in practice it does appear that it's only done for a complete 
fialure to allocate.

 >
> 
> In that case, I'm totally fine to only call retry wait if we are unable 
> to allocate any page, other than unconditionally.
> 

Okay, I agree that that's a good plan and appreciate your discussion. I 
don't know if it's worth giving-up-and-freeing if we fail to make 
progress twice, but I don't think it matters.

Thank you very much!

Sweet Tea
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7441245b1ceb..d49e7f0384ed 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -681,33 +681,22 @@  static void end_bbio_data_read(struct btrfs_bio *bbio)
 int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
 			   gfp_t extra_gfp)
 {
+	const gfp_t gfp = GFP_NOFS | extra_gfp;
 	unsigned int allocated;
 
-	for (allocated = 0; allocated < nr_pages;) {
-		unsigned int last = allocated;
-
-		allocated = alloc_pages_bulk_array(GFP_NOFS | extra_gfp,
-						   nr_pages, page_array);
-
-		if (allocated == nr_pages)
-			return 0;
-
-		/*
-		 * During this iteration, no page could be allocated, even
-		 * though alloc_pages_bulk_array() falls back to alloc_page()
-		 * if  it could not bulk-allocate. So we must be out of memory.
-		 */
-		if (allocated == last) {
-			for (int i = 0; i < allocated; i++) {
-				__free_page(page_array[i]);
-				page_array[i] = NULL;
-			}
-			return -ENOMEM;
-		}
-
-		memalloc_retry_wait(GFP_NOFS);
+	allocated = alloc_pages_bulk_array(GFP_NOFS | gfp, nr_pages, page_array);
+	for (; allocated < nr_pages; allocated++) {
+		page_array[allocated] = alloc_page(gfp);
+		if (unlikely(!page_array[allocated]))
+			goto enomem;
 	}
 	return 0;
+enomem:
+	for (int i = 0; i < allocated; i++) {
+		__free_page(page_array[i]);
+		page_array[i] = NULL;
+	}
+	return -ENOMEM;
 }
 
 /*