diff mbox series

[7/9] btrfs: drop NOFAIL from set_extent_bit allocation masks

Message ID 232abb666a6901f909aeb21dc6f5998f0250e073.1684967923.git.dsterba@suse.com (mailing list archive)
State New, archived
Headers show
Series Parameter cleanups in extent state helpers | expand

Commit Message

David Sterba May 24, 2023, 11:04 p.m. UTC
The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010
(commit f0486c68e4bd9a ("Btrfs: Introduce contexts for metadata
reservation")), without any explanation why it would be needed.

Meanwhile we've updated the semantics of set_extent_bit to handle failed
allocations and do unlock, sleep and retry if needed.  The use of the
NOFAIL flag is also an outlier, we never want any of the set/clear
extent bit helpers to fail, they're used for many critical changes like
extent locking, besides the extent state bit changes.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/block-group.c | 3 +--
 fs/btrfs/extent-tree.c | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig May 25, 2023, 9:06 a.m. UTC | #1
On Thu, May 25, 2023 at 01:04:34AM +0200, David Sterba wrote:
> The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010
> (commit f0486c68e4bd9a ("Btrfs: Introduce contexts for metadata
> reservation")), without any explanation why it would be needed.
> 
> Meanwhile we've updated the semantics of set_extent_bit to handle failed
> allocations and do unlock, sleep and retry if needed.  The use of the
> NOFAIL flag is also an outlier, we never want any of the set/clear
> extent bit helpers to fail, they're used for many critical changes like
> extent locking, besides the extent state bit changes.

Given how many of the callers do not check the return value, and that
the trees store essential information, I think the right thing here
is to always use __GFP_NOFAIL unless GFP_NOWAIT is passed.  In practice
this won't make a difference as currently small slab allocations never
fail, but that's an undocumented assumption.
Qu Wenruo May 25, 2023, 10:38 a.m. UTC | #2
On 2023/5/25 07:04, David Sterba wrote:
> The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010
> (commit f0486c68e4bd9a ("Btrfs: Introduce contexts for metadata
> reservation")), without any explanation why it would be needed.
>
> Meanwhile we've updated the semantics of set_extent_bit to handle failed
> allocations and do unlock, sleep and retry if needed.  The use of the
> NOFAIL flag is also an outlier, we never want any of the set/clear
> extent bit helpers to fail, they're used for many critical changes like
> extent locking, besides the extent state bit changes.

As I mentioned in the cover letter, if we really want to set/clear bits
to not fail, and can accept the extra memory usage (as high as twice the
number of extent states), then we should really consider the following
changes:

- Introduce hole extent_state
   For ranges without any bit set, there still needs to be an
   extent_state.

- Make callers to pre-allocate 2 extent_state and pass them as mandatory
   parameters to set/clear bits

- Make set/clear bits to use the preallocated 2 extent states

By this, we should be able to completely get rid of the memory
allocation inside the extent io tree set/clear calls.

Thanks,
Qu
>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>   fs/btrfs/block-group.c | 3 +--
>   fs/btrfs/extent-tree.c | 3 +--
>   2 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index ec463f8d83ec..202e2aa949c5 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -3523,8 +3523,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
>
>   			set_extent_bit(&trans->transaction->pinned_extents,
>   				       bytenr, bytenr + num_bytes - 1,
> -				       EXTENT_DIRTY, NULL,
> -				       GFP_NOFS | __GFP_NOFAIL);
> +				       EXTENT_DIRTY, NULL, GFP_NOFS);
>   		}
>
>   		spin_lock(&trans->transaction->dirty_bgs_lock);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 03b2a7c508b9..6e319100e3a3 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2508,8 +2508,7 @@ static int pin_down_extent(struct btrfs_trans_handle *trans,
>   	spin_unlock(&cache->space_info->lock);
>
>   	set_extent_bit(&trans->transaction->pinned_extents, bytenr,
> -		       bytenr + num_bytes - 1, EXTENT_DIRTY, NULL,
> -		       GFP_NOFS | __GFP_NOFAIL);
> +		       bytenr + num_bytes - 1, EXTENT_DIRTY, NULL, GFP_NOFS);
>   	return 0;
>   }
>
David Sterba May 25, 2023, 11 p.m. UTC | #3
On Thu, May 25, 2023 at 06:38:51PM +0800, Qu Wenruo wrote:
> On 2023/5/25 07:04, David Sterba wrote:
> > The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010
> > (commit f0486c68e4bd9a ("Btrfs: Introduce contexts for metadata
> > reservation")), without any explanation why it would be needed.
> >
> > Meanwhile we've updated the semantics of set_extent_bit to handle failed
> > allocations and do unlock, sleep and retry if needed.  The use of the
> > NOFAIL flag is also an outlier, we never want any of the set/clear
> > extent bit helpers to fail, they're used for many critical changes like
> > extent locking, besides the extent state bit changes.
> 
> As I mentioned in the cover letter, if we really want to set/clear bits
> to not fail, and can accept the extra memory usage (as high as twice the
> number of extent states), then we should really consider the following
> changes:
> 
> - Introduce hole extent_state
>    For ranges without any bit set, there still needs to be an
>    extent_state.
> 
> - Make callers to pre-allocate 2 extent_state and pass them as mandatory
>    parameters to set/clear bits
> 
> - Make set/clear bits to use the preallocated 2 extent states
> 
> By this, we should be able to completely get rid of the memory
> allocation inside the extent io tree set/clear calls.

I did not understand what you mean from the cover letter comment but now
I see it. The size of extent_state is 72 bytes on release build, which
is not that much, doubling that size is still in acceptable range even
if we'd never utilize the preallocated memory at some point. That we
dont't see any failures now is 1) GFP_NOFS does not fail 2) we're using
a slab cache and the objects get reused within the same allocated space
due to frequent state changes. Even if 1) wasn't true we'd still have to
hit a hard memory allocation error, i.e. no available pages for a slab
extension.

If the preallocation prevents any failure due to memory allocations then
we can keep the current way of dealing with extent bit change error
handling, i.e. none because we rely on the allocator.
David Sterba May 25, 2023, 11:25 p.m. UTC | #4
On Thu, May 25, 2023 at 02:06:05AM -0700, Christoph Hellwig wrote:
> On Thu, May 25, 2023 at 01:04:34AM +0200, David Sterba wrote:
> > The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010
> > (commit f0486c68e4bd9a ("Btrfs: Introduce contexts for metadata
> > reservation")), without any explanation why it would be needed.
> > 
> > Meanwhile we've updated the semantics of set_extent_bit to handle failed
> > allocations and do unlock, sleep and retry if needed.  The use of the
> > NOFAIL flag is also an outlier, we never want any of the set/clear
> > extent bit helpers to fail, they're used for many critical changes like
> > extent locking, besides the extent state bit changes.
> 
> Given how many of the callers do not check the return value, and that
> the trees store essential information, I think the right thing here
> is to always use __GFP_NOFAIL unless GFP_NOWAIT is passed.  In practice
> this won't make a difference as currently small slab allocations never
> fail, but that's an undocumented assumption.

Yeah, that "GFP_NOFS does not fail for < costly allocations" has been a
deal with the allocator but there was some pressure to stop doing that
eventually. Discussed at LSF/MM, __GFP_NOFAIL should be replaced by
proper error handling if possible and minimized. I'm not sure if adding
it for something that frequently done as the state bit changes is a good
idea wrt memory management but I agree there's practically no change how
things currently work.

Qu suggested another way how we could preallocate some memory that would
be used for extent holes and then reused for the cases where we need to
allocate to do insert/split. I'd like to try this approach first and also
check with MM guys that NOFAIL in this case is acceptable so we have
more options to choose from.
Qu Wenruo May 25, 2023, 11:26 p.m. UTC | #5
On 2023/5/26 07:00, David Sterba wrote:
> On Thu, May 25, 2023 at 06:38:51PM +0800, Qu Wenruo wrote:
>> On 2023/5/25 07:04, David Sterba wrote:
>>> The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010
>>> (commit f0486c68e4bd9a ("Btrfs: Introduce contexts for metadata
>>> reservation")), without any explanation why it would be needed.
>>>
>>> Meanwhile we've updated the semantics of set_extent_bit to handle failed
>>> allocations and do unlock, sleep and retry if needed.  The use of the
>>> NOFAIL flag is also an outlier, we never want any of the set/clear
>>> extent bit helpers to fail, they're used for many critical changes like
>>> extent locking, besides the extent state bit changes.
>>
>> As I mentioned in the cover letter, if we really want to set/clear bits
>> to not fail, and can accept the extra memory usage (as high as twice the
>> number of extent states), then we should really consider the following
>> changes:
>>
>> - Introduce hole extent_state
>>     For ranges without any bit set, there still needs to be an
>>     extent_state.
>>
>> - Make callers to pre-allocate 2 extent_state and pass them as mandatory
>>     parameters to set/clear bits
>>
>> - Make set/clear bits to use the preallocated 2 extent states
>>
>> By this, we should be able to completely get rid of the memory
>> allocation inside the extent io tree set/clear calls.
>
> I did not understand what you mean from the cover letter comment but now
> I see it. The size of extent_state is 72 bytes on release build, which
> is not that much, doubling that size is still in acceptable range even
> if we'd never utilize the preallocated memory at some point. That we
> dont't see any failures now is 1) GFP_NOFS does not fail 2) we're using
> a slab cache and the objects get reused within the same allocated space
> due to frequent state changes. Even if 1) wasn't true we'd still have to
> hit a hard memory allocation error, i.e. no available pages for a slab
> extension.

Well, I still remember an internal bugzilla that openSUSE on RPI4
triggered failure at the extent_state allocation.

>
> If the preallocation prevents any failure due to memory allocations then
> we can keep the current way of dealing with extent bit change error
> handling, i.e. none because we rely on the allocator.

The preallocation is not going to help much after the first allocation
is being used.

After that, all later states are allocated using GFP_ATOMIC, which has a
much smaller pool to fulfill from.
And considering how many call sites don't check the return value at all,
we'd either go forward changing all call sites, or really make the
set/clear bits bulletproof (and only fail out of the set/clear call).

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index ec463f8d83ec..202e2aa949c5 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3523,8 +3523,7 @@  int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 
 			set_extent_bit(&trans->transaction->pinned_extents,
 				       bytenr, bytenr + num_bytes - 1,
-				       EXTENT_DIRTY, NULL,
-				       GFP_NOFS | __GFP_NOFAIL);
+				       EXTENT_DIRTY, NULL, GFP_NOFS);
 		}
 
 		spin_lock(&trans->transaction->dirty_bgs_lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 03b2a7c508b9..6e319100e3a3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2508,8 +2508,7 @@  static int pin_down_extent(struct btrfs_trans_handle *trans,
 	spin_unlock(&cache->space_info->lock);
 
 	set_extent_bit(&trans->transaction->pinned_extents, bytenr,
-		       bytenr + num_bytes - 1, EXTENT_DIRTY, NULL,
-		       GFP_NOFS | __GFP_NOFAIL);
+		       bytenr + num_bytes - 1, EXTENT_DIRTY, NULL, GFP_NOFS);
 	return 0;
 }