diff mbox series

btrfs: retry flushing for del_balance_item() if the transaction is interrupted

Message ID 77ec19769e75c704cb260b98b41e33340a51c40c.1692181669.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: retry flushing for del_balance_item() if the transaction is interrupted | expand

Commit Message

Qu Wenruo Aug. 16, 2023, 10:28 a.m. UTC
[BUG]

There is an internal bug report that there are only 3 lines of btrfs
errors, then btrfs falls read-only:

 [358958.022131] BTRFS info (device dm-9): balance: canceled
 [358958.022148] BTRFS: error (device dm-9) in __cancel_balance:4014: errno=-4 unknown
 [358958.022150] BTRFS info (device dm-9): forced readonly

[CAUSE]
The error number -4 is -EINTR, and according to the code line (although
backported kernel, the code is still relevant upstream), it's the
btrfs_handle_fs_error() call inside reset_balance_state().

This can happen when we try to start a transaction which requires
metadata flushing.

This metadata flushing can be interrupted by signal, thus it can return
-EINTR.

For our case, the -EINTR is deadly because we don't handle the error at
all, and immediately mark the fs read-only in the following call chain:

reset_balance_state()
|- del_balance_item()
|  `- btrfs_start_transation_fallback_global_rsv()
|     `- start_transaction()
|	 `- btrfs_block_rsv_add()
|	    `- __reserve_bytes()
|	       `- handle_reserve_ticket()
|		  `- wait_reserve_ticket()
|		     `- prepare_to_wait_event()
|			This wait has TASK_KILLABLE, thus can be
|			interrupted.
|			Thus we return -EINTR.
|
|- IS_ERR(trans) triggered
|- btrfs_handle_fs_error()
   The fs is marked read-only.

[FIX]
For this particular call site, we can not afford just erroring out with
-EINTR.

This patch would fix the error by retry until either we got a valid
transaction handle, or got an error other than -EINTR.

Since we're here, also enhance the error message a little to make it
more readable.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

Filipe Manana Aug. 16, 2023, 12:45 p.m. UTC | #1
On Wed, Aug 16, 2023 at 06:28:16PM +0800, Qu Wenruo wrote:
> [BUG]
> 
> There is an internal bug report that there are only 3 lines of btrfs
> errors, then btrfs falls read-only:
> 
>  [358958.022131] BTRFS info (device dm-9): balance: canceled
>  [358958.022148] BTRFS: error (device dm-9) in __cancel_balance:4014: errno=-4 unknown
>  [358958.022150] BTRFS info (device dm-9): forced readonly
> 
> [CAUSE]
> The error number -4 is -EINTR, and according to the code line (although
> backported kernel, the code is still relevant upstream), it's the
> btrfs_handle_fs_error() call inside reset_balance_state().
> 
> This can happen when we try to start a transaction which requires
> metadata flushing.
> 
> This metadata flushing can be interrupted by signal, thus it can return
> -EINTR.
> 
> For our case, the -EINTR is deadly because we don't handle the error at
> all, and immediately mark the fs read-only in the following call chain:
> 
> reset_balance_state()
> |- del_balance_item()
> |  `- btrfs_start_transation_fallback_global_rsv()
> |     `- start_transaction()
> |	 `- btrfs_block_rsv_add()
> |	    `- __reserve_bytes()
> |	       `- handle_reserve_ticket()
> |		  `- wait_reserve_ticket()
> |		     `- prepare_to_wait_event()
> |			This wait has TASK_KILLABLE, thus can be
> |			interrupted.
> |			Thus we return -EINTR.
> |
> |- IS_ERR(trans) triggered
> |- btrfs_handle_fs_error()
>    The fs is marked read-only.
> 
> [FIX]
> For this particular call site, we can not afford just erroring out with
> -EINTR.
> 
> This patch would fix the error by retry until either we got a valid
> transaction handle, or got an error other than -EINTR.
> 
> Since we're here, also enhance the error message a little to make it
> more readable.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/volumes.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 189da583bb67..e83711fe31bb 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3507,7 +3507,15 @@ static int del_balance_item(struct btrfs_fs_info *fs_info)
>  	if (!path)
>  		return -ENOMEM;
>  
> -	trans = btrfs_start_transaction_fallback_global_rsv(root, 0);
> +	do {
> +		/*
> +		 * The transaction starting here can be interrupted, but if we
> +		 * just error out we would mark the fs read-only.
> +		 * Thus here we try to start the transaction again if it's
> +		 * interrupted.
> +		 */
> +		trans = btrfs_start_transaction_fallback_global_rsv(root, 0);
> +	} while (IS_ERR(trans) && PTR_ERR(trans) == -EINTR);

This condition can be simply:  trans == ERR_PTR(-EINTR)

My only concern is if this can turn into an infinite loop due to a high enough rate of
signals being sent to the process...

Instead of this I would make reset_balance_state() just print a warning, and not
call btrfs_handle_fs_error()  and then change insert_balance_item() to not fail in
case the item already exists - instead just overwrite it.

Thanks.


>  	if (IS_ERR(trans)) {
>  		btrfs_free_path(path);
>  		return PTR_ERR(trans);
> @@ -3594,7 +3602,7 @@ static void reset_balance_state(struct btrfs_fs_info *fs_info)
>  	kfree(bctl);
>  	ret = del_balance_item(fs_info);
>  	if (ret)
> -		btrfs_handle_fs_error(fs_info, ret, NULL);
> +		btrfs_handle_fs_error(fs_info, ret, "failed to delete balance item");
>  }
>  
>  /*
> -- 
> 2.41.0
>
Qu Wenruo Aug. 16, 2023, 9:54 p.m. UTC | #2
On 2023/8/16 20:45, Filipe Manana wrote:
> On Wed, Aug 16, 2023 at 06:28:16PM +0800, Qu Wenruo wrote:
>> [BUG]
>>
>> There is an internal bug report that there are only 3 lines of btrfs
>> errors, then btrfs falls read-only:
>>
>>   [358958.022131] BTRFS info (device dm-9): balance: canceled
>>   [358958.022148] BTRFS: error (device dm-9) in __cancel_balance:4014: errno=-4 unknown
>>   [358958.022150] BTRFS info (device dm-9): forced readonly
>>
>> [CAUSE]
>> The error number -4 is -EINTR, and according to the code line (although
>> backported kernel, the code is still relevant upstream), it's the
>> btrfs_handle_fs_error() call inside reset_balance_state().
>>
>> This can happen when we try to start a transaction which requires
>> metadata flushing.
>>
>> This metadata flushing can be interrupted by signal, thus it can return
>> -EINTR.
>>
>> For our case, the -EINTR is deadly because we don't handle the error at
>> all, and immediately mark the fs read-only in the following call chain:
>>
>> reset_balance_state()
>> |- del_balance_item()
>> |  `- btrfs_start_transation_fallback_global_rsv()
>> |     `- start_transaction()
>> |	 `- btrfs_block_rsv_add()
>> |	    `- __reserve_bytes()
>> |	       `- handle_reserve_ticket()
>> |		  `- wait_reserve_ticket()
>> |		     `- prepare_to_wait_event()
>> |			This wait has TASK_KILLABLE, thus can be
>> |			interrupted.
>> |			Thus we return -EINTR.
>> |
>> |- IS_ERR(trans) triggered
>> |- btrfs_handle_fs_error()
>>     The fs is marked read-only.
>>
>> [FIX]
>> For this particular call site, we can not afford just erroring out with
>> -EINTR.
>>
>> This patch would fix the error by retry until either we got a valid
>> transaction handle, or got an error other than -EINTR.
>>
>> Since we're here, also enhance the error message a little to make it
>> more readable.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/volumes.c | 12 ++++++++++--
>>   1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 189da583bb67..e83711fe31bb 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -3507,7 +3507,15 @@ static int del_balance_item(struct btrfs_fs_info *fs_info)
>>   	if (!path)
>>   		return -ENOMEM;
>>
>> -	trans = btrfs_start_transaction_fallback_global_rsv(root, 0);
>> +	do {
>> +		/*
>> +		 * The transaction starting here can be interrupted, but if we
>> +		 * just error out we would mark the fs read-only.
>> +		 * Thus here we try to start the transaction again if it's
>> +		 * interrupted.
>> +		 */
>> +		trans = btrfs_start_transaction_fallback_global_rsv(root, 0);
>> +	} while (IS_ERR(trans) && PTR_ERR(trans) == -EINTR);
>
> This condition can be simply:  trans == ERR_PTR(-EINTR)
>
> My only concern is if this can turn into an infinite loop due to a high enough rate of
> signals being sent to the process...

Yep, that's indeed a concern.

The other solution is to introduce a flag to disallow signal for the
ticket system (aka non-killable wait), which can get rid of the frequent
signal problems.

In fact, we may not want certain reclaim to be interrupted at all,
especially for BTRFS_RESERVE_FLUSH_ALL_STEAL, which are only utilized
for very critical operations like unlink and other deletion operations.

>
> Instead of this I would make reset_balance_state() just print a warning, and not
> call btrfs_handle_fs_error()  and then change insert_balance_item() to not fail in
> case the item already exists - instead just overwrite it.

This means, if a unlucky interruption happened, the left balance item
can cause us to resume a balance on the next mount, which can be
unexpected for the end user.

Thanks,
Qu
>
> Thanks.
>
>
>>   	if (IS_ERR(trans)) {
>>   		btrfs_free_path(path);
>>   		return PTR_ERR(trans);
>> @@ -3594,7 +3602,7 @@ static void reset_balance_state(struct btrfs_fs_info *fs_info)
>>   	kfree(bctl);
>>   	ret = del_balance_item(fs_info);
>>   	if (ret)
>> -		btrfs_handle_fs_error(fs_info, ret, NULL);
>> +		btrfs_handle_fs_error(fs_info, ret, "failed to delete balance item");
>>   }
>>
>>   /*
>> --
>> 2.41.0
>>
Filipe Manana Aug. 17, 2023, 7:39 a.m. UTC | #3
On Wed, Aug 16, 2023 at 10:54 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2023/8/16 20:45, Filipe Manana wrote:
> > On Wed, Aug 16, 2023 at 06:28:16PM +0800, Qu Wenruo wrote:
> >> [BUG]
> >>
> >> There is an internal bug report that there are only 3 lines of btrfs
> >> errors, then btrfs falls read-only:
> >>
> >>   [358958.022131] BTRFS info (device dm-9): balance: canceled
> >>   [358958.022148] BTRFS: error (device dm-9) in __cancel_balance:4014: errno=-4 unknown
> >>   [358958.022150] BTRFS info (device dm-9): forced readonly
> >>
> >> [CAUSE]
> >> The error number -4 is -EINTR, and according to the code line (although
> >> backported kernel, the code is still relevant upstream), it's the
> >> btrfs_handle_fs_error() call inside reset_balance_state().
> >>
> >> This can happen when we try to start a transaction which requires
> >> metadata flushing.
> >>
> >> This metadata flushing can be interrupted by signal, thus it can return
> >> -EINTR.
> >>
> >> For our case, the -EINTR is deadly because we don't handle the error at
> >> all, and immediately mark the fs read-only in the following call chain:
> >>
> >> reset_balance_state()
> >> |- del_balance_item()
> >> |  `- btrfs_start_transation_fallback_global_rsv()
> >> |     `- start_transaction()
> >> |     `- btrfs_block_rsv_add()
> >> |        `- __reserve_bytes()
> >> |           `- handle_reserve_ticket()
> >> |              `- wait_reserve_ticket()
> >> |                 `- prepare_to_wait_event()
> >> |                    This wait has TASK_KILLABLE, thus can be
> >> |                    interrupted.
> >> |                    Thus we return -EINTR.
> >> |
> >> |- IS_ERR(trans) triggered
> >> |- btrfs_handle_fs_error()
> >>     The fs is marked read-only.
> >>
> >> [FIX]
> >> For this particular call site, we can not afford just erroring out with
> >> -EINTR.
> >>
> >> This patch would fix the error by retry until either we got a valid
> >> transaction handle, or got an error other than -EINTR.
> >>
> >> Since we're here, also enhance the error message a little to make it
> >> more readable.
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >>   fs/btrfs/volumes.c | 12 ++++++++++--
> >>   1 file changed, 10 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> >> index 189da583bb67..e83711fe31bb 100644
> >> --- a/fs/btrfs/volumes.c
> >> +++ b/fs/btrfs/volumes.c
> >> @@ -3507,7 +3507,15 @@ static int del_balance_item(struct btrfs_fs_info *fs_info)
> >>      if (!path)
> >>              return -ENOMEM;
> >>
> >> -    trans = btrfs_start_transaction_fallback_global_rsv(root, 0);
> >> +    do {
> >> +            /*
> >> +             * The transaction starting here can be interrupted, but if we
> >> +             * just error out we would mark the fs read-only.
> >> +             * Thus here we try to start the transaction again if it's
> >> +             * interrupted.
> >> +             */
> >> +            trans = btrfs_start_transaction_fallback_global_rsv(root, 0);
> >> +    } while (IS_ERR(trans) && PTR_ERR(trans) == -EINTR);
> >
> > This condition can be simply:  trans == ERR_PTR(-EINTR)
> >
> > My only concern is if this can turn into an infinite loop due to a high enough rate of
> > signals being sent to the process...
>
> Yep, that's indeed a concern.
>
> The other solution is to introduce a flag to disallow signal for the
> ticket system (aka non-killable wait), which can get rid of the frequent
> signal problems.
>
> In fact, we may not want certain reclaim to be interrupted at all,
> especially for BTRFS_RESERVE_FLUSH_ALL_STEAL, which are only utilized
> for very critical operations like unlink and other deletion operations.

We shouldn't need to call
btrfs_start_transaction_fallback_global_rsv() because we don't need to
reserve space.
A join transaction would be enough, because:

1) We pass 0 items to the start transaction call;

2) More importantly, we are updating the root tree and the root tree
uses the global block reserve, see btrfs_init_root_block_rsv().

So the start transaction call should not be reserving any space
because "num_items == 0" and "flush != BTRFS_RESERVE_FLUSH_ALL".
Are you sure the -EINTR is coming from the
btrfs_start_transaction_fallback_global_rsv() and not from the
btrfs_search_slot() call at del_balance_item() for example?

Nothing in the partial log you pasted can tell the -EINTR comes from
btrfs_start_transaction_fallback_global_rsv(), which would be very
surprising
because it's not reserving any space for those reasons mentioned above.

>
> >
> > Instead of this I would make reset_balance_state() just print a warning, and not
> > call btrfs_handle_fs_error()  and then change insert_balance_item() to not fail in
> > case the item already exists - instead just overwrite it.
>
> This means, if a unlucky interruption happened, the left balance item
> can cause us to resume a balance on the next mount, which can be
> unexpected for the end user.

Yes, but maybe not much work is done unless after that some block
groups got fragmented enough to trigger the stored balance filters in
the item.
The worst case is without filters, where all block groups are always relocated.
Not ideal, yes.

But having had a closer look, my concern is that I don't see how
btrfs_start_transaction_fallback_global_rsv() can return -EINTR, and I
suspect
it comes from somewhere else.

Thanks.

>
> Thanks,
> Qu
> >
> > Thanks.
> >
> >
> >>      if (IS_ERR(trans)) {
> >>              btrfs_free_path(path);
> >>              return PTR_ERR(trans);
> >> @@ -3594,7 +3602,7 @@ static void reset_balance_state(struct btrfs_fs_info *fs_info)
> >>      kfree(bctl);
> >>      ret = del_balance_item(fs_info);
> >>      if (ret)
> >> -            btrfs_handle_fs_error(fs_info, ret, NULL);
> >> +            btrfs_handle_fs_error(fs_info, ret, "failed to delete balance item");
> >>   }
> >>
> >>   /*
> >> --
> >> 2.41.0
> >>
Qu Wenruo Aug. 17, 2023, 8:04 a.m. UTC | #4
On 2023/8/17 15:39, Filipe Manana wrote:
> On Wed, Aug 16, 2023 at 10:54 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
[...]
>>>
>>> My only concern is if this can turn into an infinite loop due to a high enough rate of
>>> signals being sent to the process...
>>
>> Yep, that's indeed a concern.
>>
>> The other solution is to introduce a flag to disallow signal for the
>> ticket system (aka non-killable wait), which can get rid of the frequent
>> signal problems.
>>
>> In fact, we may not want certain reclaim to be interrupted at all,
>> especially for BTRFS_RESERVE_FLUSH_ALL_STEAL, which are only utilized
>> for very critical operations like unlink and other deletion operations.
>
> We shouldn't need to call
> btrfs_start_transaction_fallback_global_rsv() because we don't need to
> reserve space.
> A join transaction would be enough, because:
>
> 1) We pass 0 items to the start transaction call;
>
> 2) More importantly, we are updating the root tree and the root tree
> uses the global block reserve, see btrfs_init_root_block_rsv().
>
> So the start transaction call should not be reserving any space
> because "num_items == 0" and "flush != BTRFS_RESERVE_FLUSH_ALL".

My bad, this is caused by the difference between SLE and the upstream.

For upstream, we go BTRFS_RESERVE_FLUSH_ALL_STEAL for this call site
(btrfs_start_transaction_fallback_global_rsv()), but for the SLE kernel
it's just a plain btrfs_start_transaction() which is going to need to do
block rsv.

I should just backport the patch changing the start transaction helper.

And indeed your analyse is correct, for upstream kernel we won't flush
for btrfs_start_transaction_fallback_global_rsv(), thus we don't need to
handle it at all.

> Are you sure the -EINTR is coming from the
> btrfs_start_transaction_fallback_global_rsv() and not from the
> btrfs_search_slot() call at del_balance_item() for example?

Just to fill my curiosity, I checked the tree read code which waits
using UNINTERRUPTIBLE, but it's no longer relevant anymore...

Thanks for pointing out the hole!
Qu

>
> Nothing in the partial log you pasted can tell the -EINTR comes from
> btrfs_start_transaction_fallback_global_rsv(), which would be very
> surprising
> because it's not reserving any space for those reasons mentioned above.
>
>>
>>>
>>> Instead of this I would make reset_balance_state() just print a warning, and not
>>> call btrfs_handle_fs_error()  and then change insert_balance_item() to not fail in
>>> case the item already exists - instead just overwrite it.
>>
>> This means, if a unlucky interruption happened, the left balance item
>> can cause us to resume a balance on the next mount, which can be
>> unexpected for the end user.
>
> Yes, but maybe not much work is done unless after that some block
> groups got fragmented enough to trigger the stored balance filters in
> the item.
> The worst case is without filters, where all block groups are always relocated.
> Not ideal, yes.
>
> But having had a closer look, my concern is that I don't see how
> btrfs_start_transaction_fallback_global_rsv() can return -EINTR, and I
> suspect
> it comes from somewhere else.
>
> Thanks.
>
>>
>> Thanks,
>> Qu
>>>
>>> Thanks.
>>>
>>>
>>>>       if (IS_ERR(trans)) {
>>>>               btrfs_free_path(path);
>>>>               return PTR_ERR(trans);
>>>> @@ -3594,7 +3602,7 @@ static void reset_balance_state(struct btrfs_fs_info *fs_info)
>>>>       kfree(bctl);
>>>>       ret = del_balance_item(fs_info);
>>>>       if (ret)
>>>> -            btrfs_handle_fs_error(fs_info, ret, NULL);
>>>> +            btrfs_handle_fs_error(fs_info, ret, "failed to delete balance item");
>>>>    }
>>>>
>>>>    /*
>>>> --
>>>> 2.41.0
>>>>
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 189da583bb67..e83711fe31bb 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3507,7 +3507,15 @@  static int del_balance_item(struct btrfs_fs_info *fs_info)
 	if (!path)
 		return -ENOMEM;
 
-	trans = btrfs_start_transaction_fallback_global_rsv(root, 0);
+	do {
+		/*
+		 * The transaction starting here can be interrupted, but if we
+		 * just error out we would mark the fs read-only.
+		 * Thus here we try to start the transaction again if it's
+		 * interrupted.
+		 */
+		trans = btrfs_start_transaction_fallback_global_rsv(root, 0);
+	} while (IS_ERR(trans) && PTR_ERR(trans) == -EINTR);
 	if (IS_ERR(trans)) {
 		btrfs_free_path(path);
 		return PTR_ERR(trans);
@@ -3594,7 +3602,7 @@  static void reset_balance_state(struct btrfs_fs_info *fs_info)
 	kfree(bctl);
 	ret = del_balance_item(fs_info);
 	if (ret)
-		btrfs_handle_fs_error(fs_info, ret, NULL);
+		btrfs_handle_fs_error(fs_info, ret, "failed to delete balance item");
 }
 
 /*