Message ID | 20190718062749.11276-1-wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: Allow more disks missing for RAID10 | expand |
On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote: > RAID10 can accept as much as half of its disks to be missing, as long as > each sub stripe still has a good mirror. Can you please make a test case for that? I think the number of devices that can be lost can be higher than a half in some extreme cases: one device has copies of all stripes, 2nd copy can be scattered randomly on the other devices, but that's highly unlikely to happen. On average it's same as raid1, but the more exact check can potentially utilize the stripe layout.
On 2019-07-25 14:37, David Sterba wrote: > On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote: >> RAID10 can accept as much as half of its disks to be missing, as long as >> each sub stripe still has a good mirror. > > Can you please make a test case for that? > > I think the number of devices that can be lost can be higher than a half > in some extreme cases: one device has copies of all stripes, 2nd copy > can be scattered randomly on the other devices, but that's highly > unlikely to happen. It is possible, but as you mention highly unlikely. It's also possible with raid1 mode too, and a lot less unlikely there (in fact, it's almost guaranteed to happen in certain configurations). > > On average it's same as raid1, but the more exact check can potentially > utilize the stripe layout. >
On 2019/7/26 上午2:37, David Sterba wrote: > On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote: >> RAID10 can accept as much as half of its disks to be missing, as long as >> each sub stripe still has a good mirror. > > Can you please make a test case for that? Fstests one or btrfs-progs one? > > I think the number of devices that can be lost can be higher than a half > in some extreme cases: one device has copies of all stripes, 2nd copy > can be scattered randomly on the other devices, but that's highly > unlikely to happen. > > On average it's same as raid1, but the more exact check can potentially > utilize the stripe layout. > That will be at extent level, to me it's an internal level violation, far from what we want to improve. So not that worthy. Thanks, Qu
On Fri, Jul 26, 2019 at 07:41:41AM +0800, Qu Wenruo wrote: > > > On 2019/7/26 上午2:37, David Sterba wrote: > > On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote: > >> RAID10 can accept as much as half of its disks to be missing, as long as > >> each sub stripe still has a good mirror. > > > > Can you please make a test case for that? > > Fstests one or btrfs-progs one? For fstests. > > I think the number of devices that can be lost can be higher than a half > > in some extreme cases: one device has copies of all stripes, 2nd copy > > can be scattered randomly on the other devices, but that's highly > > unlikely to happen. > > > > On average it's same as raid1, but the more exact check can potentially > > utilize the stripe layout. > > > That will be at extent level, to me it's an internal level violation, > far from what we want to improve. Ah I don't mean to go the extent level, as you implemented it is enough and an improvement.
On 2019/7/26 下午6:39, David Sterba wrote: > On Fri, Jul 26, 2019 at 07:41:41AM +0800, Qu Wenruo wrote: >> >> >> On 2019/7/26 上午2:37, David Sterba wrote: >>> On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote: >>>> RAID10 can accept as much as half of its disks to be missing, as long as >>>> each sub stripe still has a good mirror. >>> >>> Can you please make a test case for that? >> >> Fstests one or btrfs-progs one? > > For fstests. OK, that test case in fact exposed a long-existing bug, we can't create degraded chunks. So if we're replacing the missing devices on a 4 disk RAID10 btrfs, we will hit ENOSPC as we can't find 4 devices to fulfill a new chunk. And it will finally trigger transaction abort. Please discard this patch until we solve that problem. Thanks, Qu > >>> I think the number of devices that can be lost can be higher than a half >>> in some extreme cases: one device has copies of all stripes, 2nd copy >>> can be scattered randomly on the other devices, but that's highly >>> unlikely to happen. >>> >>> On average it's same as raid1, but the more exact check can potentially >>> utilize the stripe layout. >>> >> That will be at extent level, to me it's an internal level violation, >> far from what we want to improve. > > Ah I don't mean to go the extent level, as you implemented it is enough > and an improvement. >
On Wed, Jul 31, 2019 at 02:58:02PM +0800, Qu Wenruo wrote: > > > On 2019/7/26 下午6:39, David Sterba wrote: > > On Fri, Jul 26, 2019 at 07:41:41AM +0800, Qu Wenruo wrote: > >> > >> > >> On 2019/7/26 上午2:37, David Sterba wrote: > >>> On Thu, Jul 18, 2019 at 02:27:49PM +0800, Qu Wenruo wrote: > >>>> RAID10 can accept as much as half of its disks to be missing, as long as > >>>> each sub stripe still has a good mirror. > >>> > >>> Can you please make a test case for that? > >> > >> Fstests one or btrfs-progs one? > > > > For fstests. > > OK, that test case in fact exposed a long-existing bug, we can't create > degraded chunks. > > So if we're replacing the missing devices on a 4 disk RAID10 btrfs, we > will hit ENOSPC as we can't find 4 devices to fulfill a new chunk. > And it will finally trigger transaction abort. > > Please discard this patch until we solve that problem. Ok, done.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f209127a8bc6..65b10d13fc2d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7088,6 +7088,42 @@ int btrfs_read_sys_array(struct btrfs_fs_info *fs_info) return -EIO; } +static bool check_raid10_rw_degradable(struct btrfs_fs_info *fs_info, + struct extent_map *em) +{ + struct map_lookup *map = em->map_lookup; + int sub_stripes = map->sub_stripes; + int num_stripes = map->num_stripes; + int tolerance = 1; + int i, j; + + ASSERT(sub_stripes == 2); + ASSERT(num_stripes % sub_stripes == 0); + /* + * Check substripes as a group, in each group we need to + * have at least one good mirror; + */ + for (i = 0; i < num_stripes / sub_stripes; i ++) { + int missing = 0; + for (j = 0; j < sub_stripes; j++) { + struct btrfs_device *dev = map->stripes[i * 2 + j].dev; + + if (!dev || !dev->bdev || + test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state) || + dev->last_flush_error) + missing++; + } + if (missing > tolerance) { + btrfs_warn(fs_info, +"chunk %llu stripes %d,%d missing %d devices, max tolerance is %d for writable mount", + em->start, i, i + sub_stripes - 1, missing, + tolerance); + return false; + } + } + return true; +} + /* * Check if all chunks in the fs are OK for read-write degraded mount * @@ -7119,6 +7155,14 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info, int i; map = em->map_lookup; + if (map->type & BTRFS_BLOCK_GROUP_RAID10) { + ret = check_raid10_rw_degradable(fs_info, em); + if (!ret) { + free_extent_map(em); + goto out; + } + goto next; + } max_tolerated = btrfs_get_num_tolerated_disk_barrier_failures( map->type); @@ -7141,6 +7185,7 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info, ret = false; goto out; } +next: next_start = extent_map_end(em); free_extent_map(em);
RAID10 can accept as much as half of its disks to be missing, as long as each sub stripe still has a good mirror. Thanks to the per-chunk degradable check, we can handle it pretty easily now. So Add this special check for RAID10, to allow user to be creative (or crazy) using btrfs RAID10. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+)