[v6,11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG
diff mbox series

Message ID 20191213040915.3502922-12-naohiro.aota@wdc.com
State New
Headers show
Series
  • btrfs: zoned block device support
Related show

Commit Message

Naohiro Aota Dec. 13, 2019, 4:08 a.m. UTC
If the btrfs volume has mirrored block groups, it unconditionally makes
un-mirrored block groups read only. When we have mirrored block groups, but
don't have writable block groups, this will drop all writable block groups.
So, check if we have at least one writable mirrored block group before
setting un-mirrored block groups read only.

This change is necessary to handle e.g. xfstests btrfs/124 case.

When we mount degraded RAID1 FS and write to it, and then re-mount with
full device, the write pointers of corresponding zones of written block
group differ. We mark such block group as "wp_broken" and make it read
only. In this situation, we only have read only RAID1 block groups because
of "wp_broken" and un-mirrored block groups are also marked read only,
because we have RAID1 block groups. As a result, all the block groups are
now read only, so that we cannot even start the rebalance to fix the
situation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Comments

Josef Bacik Dec. 17, 2019, 7:25 p.m. UTC | #1
On 12/12/19 11:08 PM, Naohiro Aota wrote:
> If the btrfs volume has mirrored block groups, it unconditionally makes
> un-mirrored block groups read only. When we have mirrored block groups, but
> don't have writable block groups, this will drop all writable block groups.
> So, check if we have at least one writable mirrored block group before
> setting un-mirrored block groups read only.
> 
> This change is necessary to handle e.g. xfstests btrfs/124 case.
> 
> When we mount degraded RAID1 FS and write to it, and then re-mount with
> full device, the write pointers of corresponding zones of written block
> group differ. We mark such block group as "wp_broken" and make it read
> only. In this situation, we only have read only RAID1 block groups because
> of "wp_broken" and un-mirrored block groups are also marked read only,
> because we have RAID1 block groups. As a result, all the block groups are
> now read only, so that we cannot even start the rebalance to fix the
> situation.

I'm not sure I understand.  In degraded mode we're writing to just one mirror of 
a RAID1 block group, correct?  And this messes up the WP for the broken side, so 
it gets marked with wp_broken and thus RO.  How does this patch help?  The block 
groups are still marked RAID1 right?  Or are new block groups allocated with 
SINGLE or RAID0?  I'm confused.  Thanks,

Josef
Naohiro Aota Dec. 18, 2019, 7:35 a.m. UTC | #2
On Tue, Dec 17, 2019 at 02:25:37PM -0500, Josef Bacik wrote:
>On 12/12/19 11:08 PM, Naohiro Aota wrote:
>>If the btrfs volume has mirrored block groups, it unconditionally makes
>>un-mirrored block groups read only. When we have mirrored block groups, but
>>don't have writable block groups, this will drop all writable block groups.
>>So, check if we have at least one writable mirrored block group before
>>setting un-mirrored block groups read only.
>>
>>This change is necessary to handle e.g. xfstests btrfs/124 case.
>>
>>When we mount degraded RAID1 FS and write to it, and then re-mount with
>>full device, the write pointers of corresponding zones of written block
>>group differ. We mark such block group as "wp_broken" and make it read
>>only. In this situation, we only have read only RAID1 block groups because
>>of "wp_broken" and un-mirrored block groups are also marked read only,
>>because we have RAID1 block groups. As a result, all the block groups are
>>now read only, so that we cannot even start the rebalance to fix the
>>situation.
>
>I'm not sure I understand.  In degraded mode we're writing to just one 
>mirror of a RAID1 block group, correct?  And this messes up the WP for 
>the broken side, so it gets marked with wp_broken and thus RO.  How 
>does this patch help?  The block groups are still marked RAID1 right?  
>Or are new block groups allocated with SINGLE or RAID0?  I'm confused.  
>Thanks,
>
>Josef

First of all, I found that some recent change (maybe commit
112974d4067b ("btrfs: volumes: Remove ENOSPC-prone
btrfs_can_relocate()")?) solved the issue, so we no longer need patch
11 and 12. So, I will drop these two in the next version.

So, I think you may already have no interest on the answer, but just
for a note... The situation was like this:

* before degrading
   - All block groups are RAID1, working fine.
  
* degraded mount
   - Block groups allocated before degrading are RAID1. Writes goes
     into RAID1 block group and break the write pointer.
   - Newly allocated block groups are SINGLE, since we only have one
     available device.

* mount with the both drive again
   - RAID1 block groups are markd RO because of broken write pointer
   - SINGLE block groups are also marked RO because we have RAID1 block
     groups

and at this point, btrfs was somehow unable to allocate new block
group or to start blancing.
Josef Bacik Dec. 18, 2019, 2:54 p.m. UTC | #3
On 12/18/19 2:35 AM, Naohiro Aota wrote:
> On Tue, Dec 17, 2019 at 02:25:37PM -0500, Josef Bacik wrote:
>> On 12/12/19 11:08 PM, Naohiro Aota wrote:
>>> If the btrfs volume has mirrored block groups, it unconditionally makes
>>> un-mirrored block groups read only. When we have mirrored block groups, but
>>> don't have writable block groups, this will drop all writable block groups.
>>> So, check if we have at least one writable mirrored block group before
>>> setting un-mirrored block groups read only.
>>>
>>> This change is necessary to handle e.g. xfstests btrfs/124 case.
>>>
>>> When we mount degraded RAID1 FS and write to it, and then re-mount with
>>> full device, the write pointers of corresponding zones of written block
>>> group differ. We mark such block group as "wp_broken" and make it read
>>> only. In this situation, we only have read only RAID1 block groups because
>>> of "wp_broken" and un-mirrored block groups are also marked read only,
>>> because we have RAID1 block groups. As a result, all the block groups are
>>> now read only, so that we cannot even start the rebalance to fix the
>>> situation.
>>
>> I'm not sure I understand.  In degraded mode we're writing to just one mirror 
>> of a RAID1 block group, correct?  And this messes up the WP for the broken 
>> side, so it gets marked with wp_broken and thus RO.  How does this patch 
>> help?  The block groups are still marked RAID1 right? Or are new block groups 
>> allocated with SINGLE or RAID0?  I'm confused. Thanks,
>>
>> Josef
> 
> First of all, I found that some recent change (maybe commit
> 112974d4067b ("btrfs: volumes: Remove ENOSPC-prone
> btrfs_can_relocate()")?) solved the issue, so we no longer need patch
> 11 and 12. So, I will drop these two in the next version.
> 
> So, I think you may already have no interest on the answer, but just
> for a note... The situation was like this:
> 
> * before degrading
>    - All block groups are RAID1, working fine.
> 
> * degraded mount
>    - Block groups allocated before degrading are RAID1. Writes goes
>      into RAID1 block group and break the write pointer.
>    - Newly allocated block groups are SINGLE, since we only have one
>      available device.
> 
> * mount with the both drive again
>    - RAID1 block groups are markd RO because of broken write pointer
>    - SINGLE block groups are also marked RO because we have RAID1 block
>      groups
> 
> and at this point, btrfs was somehow unable to allocate new block
> group or to start blancing.

Oooh ok I see, I had it in my head we would still allocate RAID1 chunks, but we 
allocate SINGLE, so that makes sense.  Go ahead and drop those patches, and 
thanks for the explanation.

Josef

Patch
diff mbox series

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 5c04422f6f5a..b286359f3876 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1813,6 +1813,27 @@  static int read_one_block_group(struct btrfs_fs_info *info,
 	return ret;
 }
 
+/*
+ * have_mirrored_block_group - check if we have at least one writable
+ *                             mirrored Block Group
+ */
+static bool have_mirrored_block_group(struct btrfs_space_info *space_info)
+{
+	struct btrfs_block_group *block_group;
+	int i;
+
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		if (i == BTRFS_RAID_RAID0 || i == BTRFS_RAID_SINGLE)
+			continue;
+		list_for_each_entry(block_group, &space_info->block_groups[i],
+				    list) {
+			if (!block_group->ro)
+				return true;
+		}
+	}
+	return false;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -1861,6 +1882,10 @@  int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		       BTRFS_BLOCK_GROUP_RAID56_MASK |
 		       BTRFS_BLOCK_GROUP_DUP)))
 			continue;
+
+		if (!have_mirrored_block_group(space_info))
+			continue;
+
 		/*
 		 * Avoid allocating from un-mirrored block group if there are
 		 * mirrored block groups.