diff mbox series

[1/5] btrfs: check rw_devices, not num_devices for restriping

Message ID 20200110161128.21710-2-josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series clean up how we mark block groups read only | expand

Commit Message

Josef Bacik Jan. 10, 2020, 4:11 p.m. UTC
While running xfstests with compression on I noticed I was panicing on
btrfs/154.  I bisected this down to my inc_block_group_ro patches, which
was strange.

What was happening is with my patches we now use btrfs_can_overcommit()
to see if we can flip a block group read only.  Before this would fail
because we weren't taking into account the usable un-allocated space for
allocating chunks.  With my patches we were allowed to do the balance,
which is technically correct.

However this test is testing restriping with a degraded mount, something
that isn't working right because Anand's fix for the test was never
actually merged.

So now we're trying to allocate a chunk and cannot because we want to
allocate a RAID1 chunk, but there's only 1 device that's available for
usage.  This results in an ENOSPC in one of the BUG_ON(ret) paths in
relocation (and a tricky path that is going to take many more patches to
fix.)

But we shouldn't even be making it this far, we don't have enough
devices to restripe.  The problem is we're using btrfs_num_devices(),
which for some reason includes missing devices.  That's not actually
what we want, we want the rw_devices.

Fix this by getting the rw_devices.  With this patch we're no longer
panicing with my other patches applied, and we're in fact erroring out
at the correct spot instead of at inc_block_group_ro.  The fact that
this was working before was just sheer dumb luck.

Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/volumes.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Qu Wenruo Jan. 11, 2020, 9:24 a.m. UTC | #1
On 2020/1/11 上午12:11, Josef Bacik wrote:
> While running xfstests with compression on I noticed I was panicing on
> btrfs/154.  I bisected this down to my inc_block_group_ro patches, which
> was strange.
> 
> What was happening is with my patches we now use btrfs_can_overcommit()
> to see if we can flip a block group read only.  Before this would fail
> because we weren't taking into account the usable un-allocated space for
> allocating chunks.  With my patches we were allowed to do the balance,
> which is technically correct.
> 
> However this test is testing restriping with a degraded mount, something
> that isn't working right because Anand's fix for the test was never
> actually merged.
> 
> So now we're trying to allocate a chunk and cannot because we want to
> allocate a RAID1 chunk, but there's only 1 device that's available for
> usage.  This results in an ENOSPC in one of the BUG_ON(ret) paths in
> relocation (and a tricky path that is going to take many more patches to
> fix.)
> 
> But we shouldn't even be making it this far, we don't have enough
> devices to restripe.  The problem is we're using btrfs_num_devices(),
> which for some reason includes missing devices.  That's not actually
> what we want, we want the rw_devices.
> 
> Fix this by getting the rw_devices.  With this patch we're no longer
> panicing with my other patches applied, and we're in fact erroring out
> at the correct spot instead of at inc_block_group_ro.  The fact that
> this was working before was just sheer dumb luck.
> 
> Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/volumes.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 7483521a928b..a92059555754 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3881,7 +3881,14 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
>  		}
>  	}
>  
> -	num_devices = btrfs_num_devices(fs_info);
> +	/*
> +	 * rw_devices can be messed with by rm_device and device replace, so
> +	 * take the chunk_mutex to make sure we have a relatively consistent
> +	 * view of the fs at this point.
> +	 */
> +	mutex_lock(&fs_info->chunk_mutex);
> +	num_devices = fs_info->fs_devices->rw_devices;
> +	mutex_unlock(&fs_info->chunk_mutex);

chunk_mutex is the correct lock for rw_devices counter and alloc_list.
So,

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

>  
>  	/*
>  	 * SINGLE profile on-disk has no profile bit, but in-memory we have a
>
David Sterba Jan. 14, 2020, 8:56 p.m. UTC | #2
On Fri, Jan 10, 2020 at 11:11:24AM -0500, Josef Bacik wrote:
> While running xfstests with compression on I noticed I was panicing on
> btrfs/154.  I bisected this down to my inc_block_group_ro patches, which
> was strange.

Do you have stacktrace of the panic?

> What was happening is with my patches we now use btrfs_can_overcommit()
> to see if we can flip a block group read only.  Before this would fail
> because we weren't taking into account the usable un-allocated space for
> allocating chunks.  With my patches we were allowed to do the balance,
> which is technically correct.

What patches does "my patches" mean?

> However this test is testing restriping with a degraded mount, something
> that isn't working right because Anand's fix for the test was never
> actually merged.

Which patch is that?

> So now we're trying to allocate a chunk and cannot because we want to
> allocate a RAID1 chunk, but there's only 1 device that's available for
> usage.  This results in an ENOSPC in one of the BUG_ON(ret) paths in
> relocation (and a tricky path that is going to take many more patches to
> fix.)
> 
> But we shouldn't even be making it this far, we don't have enough
> devices to restripe.  The problem is we're using btrfs_num_devices(),
> which for some reason includes missing devices.  That's not actually
> what we want, we want the rw_devices.

The wrapper btrfs_num_devices takes into account an ongoing replace that
temporarily increases num_devices, so the result returned to balance is
adjusted.

That we need to know the correct number of writable devices at this
point is right. With btrfs_num_devices we'd have to subtract missing
devices, but in the end we can't use more than rw_devices.

> Fix this by getting the rw_devices.  With this patch we're no longer
> panicing with my other patches applied, and we're in fact erroring out
> at the correct spot instead of at inc_block_group_ro.  The fact that
> this was working before was just sheer dumb luck.
> 
> Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/volumes.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 7483521a928b..a92059555754 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3881,7 +3881,14 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
>  		}
>  	}
>  
> -	num_devices = btrfs_num_devices(fs_info);
> +	/*
> +	 * rw_devices can be messed with by rm_device and device replace, so
> +	 * take the chunk_mutex to make sure we have a relatively consistent
> +	 * view of the fs at this point.

Well, what does 'relatively consistent' mean here? There are enough
locks and exclusion that device remove or replace should not change the
value until btrfs_balance ends, no?

> +	 */
> +	mutex_lock(&fs_info->chunk_mutex);
> +	num_devices = fs_info->fs_devices->rw_devices;
> +	mutex_unlock(&fs_info->chunk_mutex);
>  
>  	/*
>  	 * SINGLE profile on-disk has no profile bit, but in-memory we have a
> -- 
> 2.24.1
Josef Bacik Jan. 14, 2020, 9:07 p.m. UTC | #3
On 1/14/20 12:56 PM, David Sterba wrote:
> On Fri, Jan 10, 2020 at 11:11:24AM -0500, Josef Bacik wrote:
>> While running xfstests with compression on I noticed I was panicing on
>> btrfs/154.  I bisected this down to my inc_block_group_ro patches, which
>> was strange.
> 
> Do you have stacktrace of the panic?
> 

I don't have it with me, I can reproduce when I get back.  But it's a 
BUG_ON(ret) in init_reloc_root when we do the copy_root, because we get an 
ENOSPC when trying to allocate the tree block.


>> What was happening is with my patches we now use btrfs_can_overcommit()
>> to see if we can flip a block group read only.  Before this would fail
>> because we weren't taking into account the usable un-allocated space for
>> allocating chunks.  With my patches we were allowed to do the balance,
>> which is technically correct.
> 
> What patches does "my patches" mean?
> 

The ones that convert the inc_block_group_ro() to use btrfs_can_overcommit().

>> However this test is testing restriping with a degraded mount, something
>> that isn't working right because Anand's fix for the test was never
>> actually merged.
> 
> Which patch is that?

It says in the header of btrfs/154.  I don't have xfstests in front of me right now.

> 
>> So now we're trying to allocate a chunk and cannot because we want to
>> allocate a RAID1 chunk, but there's only 1 device that's available for
>> usage.  This results in an ENOSPC in one of the BUG_ON(ret) paths in
>> relocation (and a tricky path that is going to take many more patches to
>> fix.)
>>
>> But we shouldn't even be making it this far, we don't have enough
>> devices to restripe.  The problem is we're using btrfs_num_devices(),
>> which for some reason includes missing devices.  That's not actually
>> what we want, we want the rw_devices.
> 
> The wrapper btrfs_num_devices takes into account an ongoing replace that
> temporarily increases num_devices, so the result returned to balance is
> adjusted.
> 
> That we need to know the correct number of writable devices at this
> point is right. With btrfs_num_devices we'd have to subtract missing
> devices, but in the end we can't use more than rw_devices.
> 
>> Fix this by getting the rw_devices.  With this patch we're no longer
>> panicing with my other patches applied, and we're in fact erroring out
>> at the correct spot instead of at inc_block_group_ro.  The fact that
>> this was working before was just sheer dumb luck.
>>
>> Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>   fs/btrfs/volumes.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 7483521a928b..a92059555754 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -3881,7 +3881,14 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
>>   		}
>>   	}
>>   
>> -	num_devices = btrfs_num_devices(fs_info);
>> +	/*
>> +	 * rw_devices can be messed with by rm_device and device replace, so
>> +	 * take the chunk_mutex to make sure we have a relatively consistent
>> +	 * view of the fs at this point.
> 
> Well, what does 'relatively consistent' mean here? There are enough
> locks and exclusion that device remove or replace should not change the
> value until btrfs_balance ends, no?
> 

Again I don't have the code in front of me, but there's nothing at this point to 
stop us from running in at the tail end of device replace or device rm.  The 
mutex keeps us from getting weirdly inflated values when we increment and 
decrement at the end of device replace, but there's nothing (that I can 
remember) that will stop rw devices from changing right after we check it, thus 
relatively.  Thanks,

Josef
David Sterba Jan. 16, 2020, 3:59 p.m. UTC | #4
On Tue, Jan 14, 2020 at 01:07:22PM -0800, Josef Bacik wrote:
> >> -	num_devices = btrfs_num_devices(fs_info);
> >> +	/*
> >> +	 * rw_devices can be messed with by rm_device and device replace, so
> >> +	 * take the chunk_mutex to make sure we have a relatively consistent
> >> +	 * view of the fs at this point.
> > 
> > Well, what does 'relatively consistent' mean here? There are enough
> > locks and exclusion that device remove or replace should not change the
> > value until btrfs_balance ends, no?
> > 
> 
> Again I don't have the code in front of me, but there's nothing at this point to 
> stop us from running in at the tail end of device replace or device rm.

This should be prevented by the EXCL_OP mechanism, so even the end of
device remove or replace will not be running at this time because it
cannot even start.

> The 
> mutex keeps us from getting weirdly inflated values when we increment and 
> decrement at the end of device replace, but there's nothing (that I can 
> remember) that will stop rw devices from changing right after we check it, thus 
> relatively.

rw_devices is changed in a handful of places on a mounted filesystem,
not counting device open/close. Device remove and replace are excluded
from running at that time, rw_devices can't change at this point of
balance.

btrfs_dev_replace_finishing
 - when removing srcdev, rw_devices--
 - when adding the target device as new, rw_devices++

btrfs_rm_device
 - rw_devices--

btrfs_init_new_device (called by device add)
 - rw_devices++

So the chunk mutex is either redundant or there's something I'm missing.
Josef Bacik Jan. 16, 2020, 4:25 p.m. UTC | #5
On 1/16/20 10:59 AM, David Sterba wrote:
> On Tue, Jan 14, 2020 at 01:07:22PM -0800, Josef Bacik wrote:
>>>> -	num_devices = btrfs_num_devices(fs_info);
>>>> +	/*
>>>> +	 * rw_devices can be messed with by rm_device and device replace, so
>>>> +	 * take the chunk_mutex to make sure we have a relatively consistent
>>>> +	 * view of the fs at this point.
>>>
>>> Well, what does 'relatively consistent' mean here? There are enough
>>> locks and exclusion that device remove or replace should not change the
>>> value until btrfs_balance ends, no?
>>>
>>
>> Again I don't have the code in front of me, but there's nothing at this point to
>> stop us from running in at the tail end of device replace or device rm.
> 
> This should be prevented by the EXCL_OP mechanism, so even the end of
> device remove or replace will not be running at this time because it
> cannot even start.
> 
>> The
>> mutex keeps us from getting weirdly inflated values when we increment and
>> decrement at the end of device replace, but there's nothing (that I can
>> remember) that will stop rw devices from changing right after we check it, thus
>> relatively.
> 
> rw_devices is changed in a handful of places on a mounted filesystem,
> not counting device open/close. Device remove and replace are excluded
> from running at that time, rw_devices can't change at this point of
> balance.
> 
> btrfs_dev_replace_finishing
>   - when removing srcdev, rw_devices--
>   - when adding the target device as new, rw_devices++
> 
> btrfs_rm_device
>   - rw_devices--
> 
> btrfs_init_new_device (called by device add)
>   - rw_devices++
> 
> So the chunk mutex is either redundant or there's something I'm missing.
> 

Nope you're right, I missed the EXCL_OP thing, so we can just read rw_devices 
normally.  Thanks,

Josef
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7483521a928b..a92059555754 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3881,7 +3881,14 @@  int btrfs_balance(struct btrfs_fs_info *fs_info,
 		}
 	}
 
-	num_devices = btrfs_num_devices(fs_info);
+	/*
+	 * rw_devices can be messed with by rm_device and device replace, so
+	 * take the chunk_mutex to make sure we have a relatively consistent
+	 * view of the fs at this point.
+	 */
+	mutex_lock(&fs_info->chunk_mutex);
+	num_devices = fs_info->fs_devices->rw_devices;
+	mutex_unlock(&fs_info->chunk_mutex);
 
 	/*
 	 * SINGLE profile on-disk has no profile bit, but in-memory we have a