diff mbox series

btrfs: Allow more disks missing for RAID10

Message ID 20190718062749.11276-1-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: Allow more disks missing for RAID10 | expand

Commit Message

Qu Wenruo July 18, 2019, 6:27 a.m. UTC
RAID10 can accept as much as half of its disks to be missing, as long as
each sub stripe still has a good mirror.

Thanks to the per-chunk degradable check, we can handle it pretty easily
now.

So Add this special check for RAID10, to allow user to be creative
(or crazy) using btrfs RAID10.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

Comments

David Sterba July 25, 2019, 6:37 p.m. UTC | #1
On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote:
> RAID10 can accept as much as half of its disks to be missing, as long as
> each sub stripe still has a good mirror.

Can you please make a test case for that?

I think the number of devices that can be lost can be higher than a half
in some extreme cases: one device has copies of all stripes, 2nd copy
can be scattered randomly on the other devices, but that's highly
unlikely to happen.

On average it's same as raid1, but the more exact check can potentially
utilize the stripe layout.
Austin S. Hemmelgarn July 25, 2019, 7:14 p.m. UTC | #2
On 2019-07-25 14:37, David Sterba wrote:
> On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote:
>> RAID10 can accept as much as half of its disks to be missing, as long as
>> each sub stripe still has a good mirror.
> 
> Can you please make a test case for that?
> 
> I think the number of devices that can be lost can be higher than a half
> in some extreme cases: one device has copies of all stripes, 2nd copy
> can be scattered randomly on the other devices, but that's highly
> unlikely to happen.
It is possible, but as you mention highly unlikely.  It's also possible 
with raid1 mode too, and a lot less unlikely there (in fact, it's almost 
guaranteed to happen in certain configurations).
> 
> On average it's same as raid1, but the more exact check can potentially
> utilize the stripe layout.
>
Qu Wenruo July 25, 2019, 11:41 p.m. UTC | #3
On 2019/7/26 上午2:37, David Sterba wrote:
> On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote:
>> RAID10 can accept as much as half of its disks to be missing, as long as
>> each sub stripe still has a good mirror.
> 
> Can you please make a test case for that?

Fstests one or btrfs-progs one?

> 
> I think the number of devices that can be lost can be higher than a half
> in some extreme cases: one device has copies of all stripes, 2nd copy
> can be scattered randomly on the other devices, but that's highly
> unlikely to happen.
> 
> On average it's same as raid1, but the more exact check can potentially
> utilize the stripe layout.
> 
That will be at extent level, to me it's an internal level violation,
far from what we want to improve.

So not that worthy.

Thanks,
Qu
David Sterba July 26, 2019, 10:39 a.m. UTC | #4
On Fri, Jul 26, 2019 at 07:41:41AM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/7/26 上午2:37, David Sterba wrote:
> > On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote:
> >> RAID10 can accept as much as half of its disks to be missing, as long as
> >> each sub stripe still has a good mirror.
> > 
> > Can you please make a test case for that?
> 
> Fstests one or btrfs-progs one?

For fstests.

> > I think the number of devices that can be lost can be higher than a half
> > in some extreme cases: one device has copies of all stripes, 2nd copy
> > can be scattered randomly on the other devices, but that's highly
> > unlikely to happen.
> > 
> > On average it's same as raid1, but the more exact check can potentially
> > utilize the stripe layout.
> > 
> That will be at extent level, to me it's an internal level violation,
> far from what we want to improve.

Ah I don't mean to go the extent level, as you implemented it is enough
and an improvement.
Qu Wenruo July 31, 2019, 6:58 a.m. UTC | #5
On 2019/7/26 下午6:39, David Sterba wrote:
> On Fri, Jul 26, 2019 at 07:41:41AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2019/7/26 上午2:37, David Sterba wrote:
>>> On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote:
>>>> RAID10 can accept as much as half of its disks to be missing, as long as
>>>> each sub stripe still has a good mirror.
>>>
>>> Can you please make a test case for that?
>>
>> Fstests one or btrfs-progs one?
> 
> For fstests.

OK, that test case in fact exposed a long-existing bug, we can't create
degraded chunks.

So if we're replacing the missing devices on a 4 disk RAID10 btrfs, we
will hit ENOSPC as we can't find 4 devices to fulfill a new chunk.
And it will finally trigger transaction abort.

Please discard this patch until we solve that problem.

Thanks,
Qu

> 
>>> I think the number of devices that can be lost can be higher than a half
>>> in some extreme cases: one device has copies of all stripes, 2nd copy
>>> can be scattered randomly on the other devices, but that's highly
>>> unlikely to happen.
>>>
>>> On average it's same as raid1, but the more exact check can potentially
>>> utilize the stripe layout.
>>>
>> That will be at extent level, to me it's an internal level violation,
>> far from what we want to improve.
> 
> Ah I don't mean to go the extent level, as you implemented it is enough
> and an improvement.
>
David Sterba July 31, 2019, 1:23 p.m. UTC | #6
On Wed, Jul 31, 2019 at 02:58:02PM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/7/26 下午6:39, David Sterba wrote:
> > On Fri, Jul 26, 2019 at 07:41:41AM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2019/7/26 上午2:37, David Sterba wrote:
> >>> On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote:
> >>>> RAID10 can accept as much as half of its disks to be missing, as long as
> >>>> each sub stripe still has a good mirror.
> >>>
> >>> Can you please make a test case for that?
> >>
> >> Fstests one or btrfs-progs one?
> > 
> > For fstests.
> 
> OK, that test case in fact exposed a long-existing bug, we can't create
> degraded chunks.
> 
> So if we're replacing the missing devices on a 4 disk RAID10 btrfs, we
> will hit ENOSPC as we can't find 4 devices to fulfill a new chunk.
> And it will finally trigger transaction abort.
> 
> Please discard this patch until we solve that problem.

Ok, done.
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f209127a8bc6..65b10d13fc2d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7088,6 +7088,42 @@  int btrfs_read_sys_array(struct btrfs_fs_info *fs_info)
 	return -EIO;
 }
 
+static bool check_raid10_rw_degradable(struct btrfs_fs_info *fs_info,
+				       struct extent_map *em)
+{
+	struct map_lookup *map = em->map_lookup;
+	int sub_stripes = map->sub_stripes;
+	int num_stripes = map->num_stripes;
+	int tolerance = 1;
+	int i, j;
+
+	ASSERT(sub_stripes == 2);
+	ASSERT(num_stripes % sub_stripes == 0);
+	/*
+	 * Check substripes as a group, in each group we need to
+	 * have at least one good mirror;
+	 */
+	for (i = 0; i < num_stripes / sub_stripes; i ++) {
+		int missing = 0;
+		for (j = 0; j < sub_stripes; j++) {
+			struct btrfs_device *dev = map->stripes[i * 2 + j].dev;
+
+			if (!dev || !dev->bdev ||
+			    test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state) ||
+			    dev->last_flush_error)
+				missing++;
+		}
+		if (missing > tolerance) {
+			btrfs_warn(fs_info,
+"chunk %llu stripes %d,%d missing %d devices, max tolerance is %d for writable mount",
+				   em->start, i, i + sub_stripes - 1, missing,
+				   tolerance);
+			return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Check if all chunks in the fs are OK for read-write degraded mount
  *
@@ -7119,6 +7155,14 @@  bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 		int i;
 
 		map = em->map_lookup;
+		if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
+			ret = check_raid10_rw_degradable(fs_info, em);
+			if (!ret) {
+				free_extent_map(em);
+				goto out;
+			}
+			goto next;
+		}
 		max_tolerated =
 			btrfs_get_num_tolerated_disk_barrier_failures(
 					map->type);
@@ -7141,6 +7185,7 @@  bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 			ret = false;
 			goto out;
 		}
+next:
 		next_start = extent_map_end(em);
 		free_extent_map(em);