btrfs: zoned: handle broken write pointer on zones

Message ID	6a8b1550cef136b1d733d5c1016a7ba717335344.1725035560.git.naohiro.aota@wdc.com (mailing list archive)
State	New, archived
Headers	show Received: from esa5.hgst.iphmx.com (esa5.hgst.iphmx.com [216.71.153.144]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D2C3326ADD for <linux-btrfs@vger.kernel.org>; Fri, 30 Aug 2024 16:33:04 +0000 (UTC) IronPort-SDR: 66d1e778_Wy311KnLYOZ+AMWf7Zk7N/yweH9B6fW/Hqhbx227p/sgU3U gpLBc1ZJfFWd4+fxD5J44r4uYf6oSrcnGxSxHSA== WDCIronportException: Internal From: Naohiro Aota <naohiro.aota@wdc.com> To: linux-btrfs@vger.kernel.org Cc: xuefer@gmail.com, Naohiro Aota <naohiro.aota@wdc.com>, HAN Yuwei <hrx@bupt.moe> Subject: [PATCH] btrfs: zoned: handle broken write pointer on zones Date: Sat, 31 Aug 2024 01:32:49 +0900 Message-ID: <6a8b1550cef136b1d733d5c1016a7ba717335344.1725035560.git.naohiro.aota@wdc.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: zoned: handle broken write pointer on zones \| expand btrfs: zoned: handle broken write pointer on zones

Naohiro Aota Aug. 30, 2024, 4:32 p.m. UTC

Btrfs rejects to mount a FS if it finds a block group with a broken write
pointer (e.g, unequal write pointers on two zones of RAID1 block group).
Since such case can happen easily with a power-loss or crash of a system,
we need to handle the case more gently.

Handle such block group by making it unallocatable, so that there will be
no writes into it. That can be done by setting the allocation pointer at
the end of allocating region (= block_group->zone_capacity). Then, existing
code handle zone_unusable properly.

Having proper zone_capacity is necessary for the change. So, set it as fast
as possible.

We cannot handle RAID0 and RAID10 case like this. But, they are anyway
unable to read because of a missing stripe.

Fixes: 265f7237dd25 ("btrfs: zoned: allow DUP on meta-data block groups")
Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # 6.1+
Reported-by: HAN Yuwei <hrx@bupt.moe>
Cc: Xuefer <xuefer@gmail.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/zoned.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

Yuwei Han Aug. 31, 2024, 3:55 a.m. UTC | #1

在 2024/8/31 00:32, Naohiro Aota 写道:
> Btrfs rejects to mount a FS if it finds a block group with a broken write
> pointer (e.g, unequal write pointers on two zones of RAID1 block group).
> Since such case can happen easily with a power-loss or crash of a system,
> we need to handle the case more gently.
> 
Can this (broken write pointer) be reproduced with some gentle way? So I 
can test it without worrying burning out my precious HC620 ( they are 
very difficult to purchase).
> Handle such block group by making it unallocatable, so that there will be
> no writes into it. That can be done by setting the allocation pointer at
> the end of allocating region (= block_group->zone_capacity). Then, existing
> code handle zone_unusable properly.
> 
> Having proper zone_capacity is necessary for the change. So, set it as fast
> as possible.
> 
> We cannot handle RAID0 and RAID10 case like this. But, they are anyway
> unable to read because of a missing stripe.
> 
> Fixes: 265f7237dd25 ("btrfs: zoned: allow DUP on meta-data block groups")
> Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
> CC: stable@vger.kernel.org # 6.1+
> Reported-by: HAN Yuwei <hrx@bupt.moe>
> Cc: Xuefer <xuefer@gmail.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   fs/btrfs/zoned.c | 30 +++++++++++++++++++++++++-----
>   1 file changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 66f63e82af79..047e3337852e 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1406,6 +1406,8 @@ static int btrfs_load_block_group_dup(struct btrfs_block_group *bg,
>   		return -EINVAL;
>   	}
>   
> +	bg->zone_capacity = min_not_zero(zone_info[0].capacity, zone_info[1].capacity);
> +
>   	if (zone_info[0].alloc_offset == WP_MISSING_DEV) {
>   		btrfs_err(bg->fs_info,
>   			  "zoned: cannot recover write pointer for zone %llu",
> @@ -1432,7 +1434,6 @@ static int btrfs_load_block_group_dup(struct btrfs_block_group *bg,
>   	}
>   
>   	bg->alloc_offset = zone_info[0].alloc_offset;
> -	bg->zone_capacity = min(zone_info[0].capacity, zone_info[1].capacity);
>   	return 0;
>   }
>   
> @@ -1450,6 +1451,9 @@ static int btrfs_load_block_group_raid1(struct btrfs_block_group *bg,
>   		return -EINVAL;
>   	}
>   
> +	/* In case a device is missing we have a cap of 0, so don't use it. */
> +	bg->zone_capacity = min_not_zero(zone_info[0].capacity, zone_info[1].capacity);
> +
>   	for (i = 0; i < map->num_stripes; i++) {
>   		if (zone_info[i].alloc_offset == WP_MISSING_DEV ||
>   		    zone_info[i].alloc_offset == WP_CONVENTIONAL)
> @@ -1471,9 +1475,6 @@ static int btrfs_load_block_group_raid1(struct btrfs_block_group *bg,
>   			if (test_bit(0, active))
>   				set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &bg->runtime_flags);
>   		}
> -		/* In case a device is missing we have a cap of 0, so don't use it. */
> -		bg->zone_capacity = min_not_zero(zone_info[0].capacity,
> -						 zone_info[1].capacity);
>   	}
>   
>   	if (zone_info[0].alloc_offset != WP_MISSING_DEV)
> @@ -1563,6 +1564,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
>   	unsigned long *active = NULL;
>   	u64 last_alloc = 0;
>   	u32 num_sequential = 0, num_conventional = 0;
> +	u64 profile;
>   
>   	if (!btrfs_is_zoned(fs_info))
>   		return 0;
> @@ -1623,7 +1625,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
>   		}
>   	}
>   
> -	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> +	profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
> +	switch (profile) {
>   	case 0: /* single */
>   		ret = btrfs_load_block_group_single(cache, &zone_info[0], active);
>   		break;
> @@ -1650,6 +1653,23 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
>   		goto out;
>   	}
>   
> +	if (ret == -EIO && profile != 0 && profile != BTRFS_BLOCK_GROUP_RAID0 &&
> +	    profile != BTRFS_BLOCK_GROUP_RAID10) {
> +		/*
> +		 * Detected broken write pointer.  Make this block group
> +		 * unallocatable by setting the allocation pointer at the end of
> +		 * allocatable region. Relocating this block group will fix the
> +		 * mismatch.
> +		 *
> +		 * Currently, we cannot handle RAID0 or RAID10 case like this
> +		 * because we don't have a proper zone_capacity value. But,
> +		 * reading from this block group won't work anyway by a missing
> +		 * stripe.
> +		 */
> +		cache->alloc_offset = cache->zone_capacity;
> +		ret = 0;
> +	}
> +
>   out:
>   	/* Reject non SINGLE data profiles without RST */
>   	if ((map->type & BTRFS_BLOCK_GROUP_DATA) &&

Naohiro Aota Sept. 2, 2024, 8:43 a.m. UTC | #2

On Sat, Aug 31, 2024 at 11:55:45AM GMT, Yuwei Han wrote:
> 
> 
> 在 2024/8/31 00:32, Naohiro Aota 写道:
> > Btrfs rejects to mount a FS if it finds a block group with a broken write
> > pointer (e.g, unequal write pointers on two zones of RAID1 block group).
> > Since such case can happen easily with a power-loss or crash of a system,
> > we need to handle the case more gently.
> > 
> Can this (broken write pointer) be reproduced with some gentle way? So I can
> test it without worrying burning out my precious HC620 ( they are very
> difficult to purchase).

What do you mean "gentle way"? How gentle would you want? Basically, you
need to write directly on the device with a FS to reproduce the issue,
which is not so gentle in a usual sense.

If you really want, you can do the following to reproduce it, but it will
risk the data on FS. So, I don't recommend running this on a FS with real
data, without knowing what is happening with the commands. At least, you
need to have a backup before running following commands.

1. Use "btrfs inspect-internal dump-tree -t CHUNK" to see a block group's physical address (DATA block group would be fine)
e.g,

$ btrfs inspect-internal dump-tree -t CHUNK /dev/nvme1n1|grep -B1 -A7 ' DATA'
        item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 939524096) itemoff 15671 itemsize 112
                length 134217728 owner 2 stripe_len 65536 type DATA|RAID1
                io_align 65536 io_width 65536 sector_size 4096
                num_stripes 2 sub_stripes 1
                        stripe 0 devid 1 offset 939524096
                        dev_uuid e5f6062c-d6cc-4557-a1a5-087c56262568
                        stripe 1 devid 2 offset 536870912
                        dev_uuid 50c49da8-0925-4580-a73a-9c738806e8e0

So, we have a zone on LBA 939524096 on devid 1 (= /dev/nvme1n1 on my setup).

2. Use "blkzone" to see the write pointer position
e.g,

$ blkzone report -o $((939524096 / 512)) -c 1 /dev/nvme1n1
  start: 0x0001c0000, len 0x040000, cap 0x040000, wptr 0x005000 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]

The "wptr" column shows the write pointer position in sectors unit. It is
0x5000 sectors = 0x5000 * 512 bytes = 10485760 bytes = 10MB

3. *DANGEROUS PART* Use "dd" to move the write pointer, really taking care of the option

e.g,
dd if=/dev/zero of=/dev/nvme1n1 bs=4096 seek=$(( (939524096 + 10485760) / 4096 )) count=1

4. Check if the write pointer is moved

e.g,
$ blkzone report -o $((939524096 / 512)) -c 1 /dev/nvme1n1
  start: 0x0001c0000, len 0x040000, cap 0x040000, wptr 0x005008 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]

wptr changed from 0x5000 to 0x5008

5. mount the FS

e.g,
# mount /dev/nvme1n1 /mnt/scratch
[  895.570990][  T815] BTRFS info (device nvme1n1): first mount of filesystem 4d863012-ad89-478c-9802-113026a1446b
[  895.571684][  T815] BTRFS info (device nvme1n1): using crc32c (crc32c-intel) checksum algorithm
[  895.572158][  T815] BTRFS info (device nvme1n1): using free-space-tree
[  895.579045][  T815] BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme1n1, 160 zones of 134217728 bytes
[  895.579645][  T815] BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme2n1, 160 zones of 134217728 bytes
[  895.579956][  T815] BTRFS info (device nvme1n1): zoned: async discard ignored and disabled for zoned mode
[  895.580156][  T815] BTRFS info (device nvme1n1): zoned mode enabled with zone size 134217728
[  895.581260][  T815] BTRFS error (device nvme1n1): zoned: write pointer offset mismatch of zones in raid1 profile

Write pointer mismatch is still reported, but

# ls /mnt/scratch
foo
# dd if=/dev/zero of=/mnt/scratch/bar bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.0317761 s, 330 MB/s
# sync

You should still be able to read/write the FS.

> > Handle such block group by making it unallocatable, so that there will be
> > no writes into it. That can be done by setting the allocation pointer at
> > the end of allocating region (= block_group->zone_capacity). Then, existing
> > code handle zone_unusable properly.
> > 
> > Having proper zone_capacity is necessary for the change. So, set it as fast
> > as possible.
> > 
> > We cannot handle RAID0 and RAID10 case like this. But, they are anyway
> > unable to read because of a missing stripe.
> > 
> > Fixes: 265f7237dd25 ("btrfs: zoned: allow DUP on meta-data block groups")
> > Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
> > CC: stable@vger.kernel.org # 6.1+
> > Reported-by: HAN Yuwei <hrx@bupt.moe>
> > Cc: Xuefer <xuefer@gmail.com>
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >   fs/btrfs/zoned.c | 30 +++++++++++++++++++++++++-----
> >   1 file changed, 25 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> > index 66f63e82af79..047e3337852e 100644
> > --- a/fs/btrfs/zoned.c
> > +++ b/fs/btrfs/zoned.c
> > @@ -1406,6 +1406,8 @@ static int btrfs_load_block_group_dup(struct btrfs_block_group *bg,
> >   		return -EINVAL;
> >   	}
> > +	bg->zone_capacity = min_not_zero(zone_info[0].capacity, zone_info[1].capacity);
> > +
> >   	if (zone_info[0].alloc_offset == WP_MISSING_DEV) {
> >   		btrfs_err(bg->fs_info,
> >   			  "zoned: cannot recover write pointer for zone %llu",
> > @@ -1432,7 +1434,6 @@ static int btrfs_load_block_group_dup(struct btrfs_block_group *bg,
> >   	}
> >   	bg->alloc_offset = zone_info[0].alloc_offset;
> > -	bg->zone_capacity = min(zone_info[0].capacity, zone_info[1].capacity);
> >   	return 0;
> >   }
> > @@ -1450,6 +1451,9 @@ static int btrfs_load_block_group_raid1(struct btrfs_block_group *bg,
> >   		return -EINVAL;
> >   	}
> > +	/* In case a device is missing we have a cap of 0, so don't use it. */
> > +	bg->zone_capacity = min_not_zero(zone_info[0].capacity, zone_info[1].capacity);
> > +
> >   	for (i = 0; i < map->num_stripes; i++) {
> >   		if (zone_info[i].alloc_offset == WP_MISSING_DEV ||
> >   		    zone_info[i].alloc_offset == WP_CONVENTIONAL)
> > @@ -1471,9 +1475,6 @@ static int btrfs_load_block_group_raid1(struct btrfs_block_group *bg,
> >   			if (test_bit(0, active))
> >   				set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &bg->runtime_flags);
> >   		}
> > -		/* In case a device is missing we have a cap of 0, so don't use it. */
> > -		bg->zone_capacity = min_not_zero(zone_info[0].capacity,
> > -						 zone_info[1].capacity);
> >   	}
> >   	if (zone_info[0].alloc_offset != WP_MISSING_DEV)
> > @@ -1563,6 +1564,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
> >   	unsigned long *active = NULL;
> >   	u64 last_alloc = 0;
> >   	u32 num_sequential = 0, num_conventional = 0;
> > +	u64 profile;
> >   	if (!btrfs_is_zoned(fs_info))
> >   		return 0;
> > @@ -1623,7 +1625,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
> >   		}
> >   	}
> > -	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> > +	profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
> > +	switch (profile) {
> >   	case 0: /* single */
> >   		ret = btrfs_load_block_group_single(cache, &zone_info[0], active);
> >   		break;
> > @@ -1650,6 +1653,23 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
> >   		goto out;
> >   	}
> > +	if (ret == -EIO && profile != 0 && profile != BTRFS_BLOCK_GROUP_RAID0 &&
> > +	    profile != BTRFS_BLOCK_GROUP_RAID10) {
> > +		/*
> > +		 * Detected broken write pointer.  Make this block group
> > +		 * unallocatable by setting the allocation pointer at the end of
> > +		 * allocatable region. Relocating this block group will fix the
> > +		 * mismatch.
> > +		 *
> > +		 * Currently, we cannot handle RAID0 or RAID10 case like this
> > +		 * because we don't have a proper zone_capacity value. But,
> > +		 * reading from this block group won't work anyway by a missing
> > +		 * stripe.
> > +		 */
> > +		cache->alloc_offset = cache->zone_capacity;
> > +		ret = 0;
> > +	}
> > +
> >   out:
> >   	/* Reject non SINGLE data profiles without RST */
> >   	if ((map->type & BTRFS_BLOCK_GROUP_DATA) &&

Yuwei Han Sept. 2, 2024, 3:04 p.m. UTC | #3

在 2024/9/2 16:43, Naohiro Aota 写道:
> On Sat, Aug 31, 2024 at 11:55:45AM GMT, Yuwei Han wrote:
>>
>>
>> 在 2024/8/31 00:32, Naohiro Aota 写道:
>>> Btrfs rejects to mount a FS if it finds a block group with a broken write
>>> pointer (e.g, unequal write pointers on two zones of RAID1 block group).
>>> Since such case can happen easily with a power-loss or crash of a system,
>>> we need to handle the case more gently.
>>>
>> Can this (broken write pointer) be reproduced with some gentle way? So I can
>> test it without worrying burning out my precious HC620 ( they are very
>> difficult to purchase).
> 
> What do you mean "gentle way"? How gentle would you want? Basically, you
> need to write directly on the device with a FS to reproduce the issue,
> which is not so gentle in a usual sense.
I mean to reproduce it without hard reset my machine and it can't 
reliable reproduce it.
Because I have reset my disks, I can't test this patch anymore. With 
this method I can test this patch on real device.
> 
> If you really want, you can do the following to reproduce it, but it will
> risk the data on FS. So, I don't recommend running this on a FS with real
> data, without knowing what is happening with the commands. At least, you
> need to have a backup before running following commands.
> 
No problem. I know this is FOR TESTING ONLY.
> 1. Use "btrfs inspect-internal dump-tree -t CHUNK" to see a block group's physical address (DATA block group would be fine)
> e.g,
> 
> $ btrfs inspect-internal dump-tree -t CHUNK /dev/nvme1n1|grep -B1 -A7 ' DATA'
>          item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 939524096) itemoff 15671 itemsize 112
>                  length 134217728 owner 2 stripe_len 65536 type DATA|RAID1
>                  io_align 65536 io_width 65536 sector_size 4096
>                  num_stripes 2 sub_stripes 1
>                          stripe 0 devid 1 offset 939524096
>                          dev_uuid e5f6062c-d6cc-4557-a1a5-087c56262568
>                          stripe 1 devid 2 offset 536870912
>                          dev_uuid 50c49da8-0925-4580-a73a-9c738806e8e0
> 
> So, we have a zone on LBA 939524096 on devid 1 (= /dev/nvme1n1 on my setup).
> 
> 2. Use "blkzone" to see the write pointer position
> e.g,
> 
> $ blkzone report -o $((939524096 / 512)) -c 1 /dev/nvme1n1
>    start: 0x0001c0000, len 0x040000, cap 0x040000, wptr 0x005000 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
> 
> The "wptr" column shows the write pointer position in sectors unit. It is
> 0x5000 sectors = 0x5000 * 512 bytes = 10485760 bytes = 10MB
> 
> 3. *DANGEROUS PART* Use "dd" to move the write pointer, really taking care of the option
> 
> e.g,
> dd if=/dev/zero of=/dev/nvme1n1 bs=4096 seek=$(( (939524096 + 10485760) / 4096 )) count=1
> 
> 4. Check if the write pointer is moved
> 
> e.g,
> $ blkzone report -o $((939524096 / 512)) -c 1 /dev/nvme1n1
>    start: 0x0001c0000, len 0x040000, cap 0x040000, wptr 0x005008 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
> 
> wptr changed from 0x5000 to 0x5008
> 
> 5. mount the FS
> 
> e.g,
> # mount /dev/nvme1n1 /mnt/scratch
> [  895.570990][  T815] BTRFS info (device nvme1n1): first mount of filesystem 4d863012-ad89-478c-9802-113026a1446b
> [  895.571684][  T815] BTRFS info (device nvme1n1): using crc32c (crc32c-intel) checksum algorithm
> [  895.572158][  T815] BTRFS info (device nvme1n1): using free-space-tree
> [  895.579045][  T815] BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme1n1, 160 zones of 134217728 bytes
> [  895.579645][  T815] BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme2n1, 160 zones of 134217728 bytes
> [  895.579956][  T815] BTRFS info (device nvme1n1): zoned: async discard ignored and disabled for zoned mode
> [  895.580156][  T815] BTRFS info (device nvme1n1): zoned mode enabled with zone size 134217728
> [  895.581260][  T815] BTRFS error (device nvme1n1): zoned: write pointer offset mismatch of zones in raid1 profile
> 
> Write pointer mismatch is still reported, but
> 
> # ls /mnt/scratch
> foo
> # dd if=/dev/zero of=/mnt/scratch/bar bs=1M count=10
> 10+0 records in
> 10+0 records out
> 10485760 bytes (10 MB, 10 MiB) copied, 0.0317761 s, 330 MB/s
> # sync
> 
> You should still be able to read/write the FS.
> 
Much thanks. I will test it on next merge window. :)
>>> Handle such block group by making it unallocatable, so that there will be
>>> no writes into it. That can be done by setting the allocation pointer at
>>> the end of allocating region (= block_group->zone_capacity). Then, existing
>>> code handle zone_unusable properly.
>>>
>>> Having proper zone_capacity is necessary for the change. So, set it as fast
>>> as possible.
>>>
>>> We cannot handle RAID0 and RAID10 case like this. But, they are anyway
>>> unable to read because of a missing stripe.
>>>
>>> Fixes: 265f7237dd25 ("btrfs: zoned: allow DUP on meta-data block groups")
>>> Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
>>> CC: stable@vger.kernel.org # 6.1+
>>> Reported-by: HAN Yuwei <hrx@bupt.moe>
>>> Cc: Xuefer <xuefer@gmail.com>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>> ---
>>>    fs/btrfs/zoned.c | 30 +++++++++++++++++++++++++-----
>>>    1 file changed, 25 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
>>> index 66f63e82af79..047e3337852e 100644
>>> --- a/fs/btrfs/zoned.c
>>> +++ b/fs/btrfs/zoned.c
>>> @@ -1406,6 +1406,8 @@ static int btrfs_load_block_group_dup(struct btrfs_block_group *bg,
>>>    		return -EINVAL;
>>>    	}
>>> +	bg->zone_capacity = min_not_zero(zone_info[0].capacity, zone_info[1].capacity);
>>> +
>>>    	if (zone_info[0].alloc_offset == WP_MISSING_DEV) {
>>>    		btrfs_err(bg->fs_info,
>>>    			  "zoned: cannot recover write pointer for zone %llu",
>>> @@ -1432,7 +1434,6 @@ static int btrfs_load_block_group_dup(struct btrfs_block_group *bg,
>>>    	}
>>>    	bg->alloc_offset = zone_info[0].alloc_offset;
>>> -	bg->zone_capacity = min(zone_info[0].capacity, zone_info[1].capacity);
>>>    	return 0;
>>>    }
>>> @@ -1450,6 +1451,9 @@ static int btrfs_load_block_group_raid1(struct btrfs_block_group *bg,
>>>    		return -EINVAL;
>>>    	}
>>> +	/* In case a device is missing we have a cap of 0, so don't use it. */
>>> +	bg->zone_capacity = min_not_zero(zone_info[0].capacity, zone_info[1].capacity);
>>> +
>>>    	for (i = 0; i < map->num_stripes; i++) {
>>>    		if (zone_info[i].alloc_offset == WP_MISSING_DEV ||
>>>    		    zone_info[i].alloc_offset == WP_CONVENTIONAL)
>>> @@ -1471,9 +1475,6 @@ static int btrfs_load_block_group_raid1(struct btrfs_block_group *bg,
>>>    			if (test_bit(0, active))
>>>    				set_bit(BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE, &bg->runtime_flags);
>>>    		}
>>> -		/* In case a device is missing we have a cap of 0, so don't use it. */
>>> -		bg->zone_capacity = min_not_zero(zone_info[0].capacity,
>>> -						 zone_info[1].capacity);
>>>    	}
>>>    	if (zone_info[0].alloc_offset != WP_MISSING_DEV)
>>> @@ -1563,6 +1564,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
>>>    	unsigned long *active = NULL;
>>>    	u64 last_alloc = 0;
>>>    	u32 num_sequential = 0, num_conventional = 0;
>>> +	u64 profile;
>>>    	if (!btrfs_is_zoned(fs_info))
>>>    		return 0;
>>> @@ -1623,7 +1625,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
>>>    		}
>>>    	}
>>> -	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
>>> +	profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
>>> +	switch (profile) {
>>>    	case 0: /* single */
>>>    		ret = btrfs_load_block_group_single(cache, &zone_info[0], active);
>>>    		break;
>>> @@ -1650,6 +1653,23 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
>>>    		goto out;
>>>    	}
>>> +	if (ret == -EIO && profile != 0 && profile != BTRFS_BLOCK_GROUP_RAID0 &&
>>> +	    profile != BTRFS_BLOCK_GROUP_RAID10) {
>>> +		/*
>>> +		 * Detected broken write pointer.  Make this block group
>>> +		 * unallocatable by setting the allocation pointer at the end of
>>> +		 * allocatable region. Relocating this block group will fix the
>>> +		 * mismatch.
>>> +		 *
>>> +		 * Currently, we cannot handle RAID0 or RAID10 case like this
>>> +		 * because we don't have a proper zone_capacity value. But,
>>> +		 * reading from this block group won't work anyway by a missing
>>> +		 * stripe.
>>> +		 */
>>> +		cache->alloc_offset = cache->zone_capacity;
>>> +		ret = 0;
>>> +	}
>>> +
>>>    out:
>>>    	/* Reject non SINGLE data profiles without RST */
>>>    	if ((map->type & BTRFS_BLOCK_GROUP_DATA) &&
> 
>

David Sterba Sept. 2, 2024, 8:31 p.m. UTC | #4

On Sat, Aug 31, 2024 at 01:32:49AM +0900, Naohiro Aota wrote:
> Btrfs rejects to mount a FS if it finds a block group with a broken write
> pointer (e.g, unequal write pointers on two zones of RAID1 block group).
> Since such case can happen easily with a power-loss or crash of a system,
> we need to handle the case more gently.
> 
> Handle such block group by making it unallocatable, so that there will be
> no writes into it. That can be done by setting the allocation pointer at
> the end of allocating region (= block_group->zone_capacity). Then, existing
> code handle zone_unusable properly.

This sounds like the best option, this makes the zone read-only and
relocation will reset it back to a good state. Alternatives like another
state or error bits would need tracking them and increase complexity.

> Having proper zone_capacity is necessary for the change. So, set it as fast
> as possible.
> 
> We cannot handle RAID0 and RAID10 case like this. But, they are anyway
> unable to read because of a missing stripe.
> 
> Fixes: 265f7237dd25 ("btrfs: zoned: allow DUP on meta-data block groups")
> Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
> CC: stable@vger.kernel.org # 6.1+
> Reported-by: HAN Yuwei <hrx@bupt.moe>
> Cc: Xuefer <xuefer@gmail.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: David Sterba <dsterba@suse.com>

> @@ -1650,6 +1653,23 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
>  		goto out;
>  	}
>  
> +	if (ret == -EIO && profile != 0 && profile != BTRFS_BLOCK_GROUP_RAID0 &&
> +	    profile != BTRFS_BLOCK_GROUP_RAID10) {
> +		/*
> +		 * Detected broken write pointer.  Make this block group
> +		 * unallocatable by setting the allocation pointer at the end of
> +		 * allocatable region. Relocating this block group will fix the
> +		 * mismatch.
> +		 *
> +		 * Currently, we cannot handle RAID0 or RAID10 case like this
> +		 * because we don't have a proper zone_capacity value. But,
> +		 * reading from this block group won't work anyway by a missing
> +		 * stripe.
> +		 */
> +		cache->alloc_offset = cache->zone_capacity;

Mabe print a warning, this looks like a condition that should be
communicated back to the user.

Xuefer Sept. 3, 2024, 5:24 a.m. UTC | #5

On Tue, Sep 3, 2024 at 4:31 AM David Sterba <dsterba@suse.cz> wrote:
>
> On Sat, Aug 31, 2024 at 01:32:49AM +0900, Naohiro Aota wrote:
> > Btrfs rejects to mount a FS if it finds a block group with a broken write
> > pointer (e.g, unequal write pointers on two zones of RAID1 block group).
> > Since such case can happen easily with a power-loss or crash of a system,
> > we need to handle the case more gently.
> >
> > Handle such block group by making it unallocatable, so that there will be
> > no writes into it. That can be done by setting the allocation pointer at
> > the end of allocating region (= block_group->zone_capacity). Then, existing
> > code handle zone_unusable properly.
>
> This sounds like the best option, this makes the zone read-only and
> relocation will reset it back to a good state. Alternatives like another
> state or error bits would need tracking them and increase complexity.
>
> > Having proper zone_capacity is necessary for the change. So, set it as fast
> > as possible.
> >
> > We cannot handle RAID0 and RAID10 case like this. But, they are anyway
> > unable to read because of a missing stripe.
> >
> > Fixes: 265f7237dd25 ("btrfs: zoned: allow DUP on meta-data block groups")
> > Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
> > CC: stable@vger.kernel.org # 6.1+
> > Reported-by: HAN Yuwei <hrx@bupt.moe>
> > Cc: Xuefer <xuefer@gmail.com>
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
> Reviewed-by: David Sterba <dsterba@suse.com>
>
> > @@ -1650,6 +1653,23 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
> >               goto out;
> >       }
> >
> > +     if (ret == -EIO && profile != 0 && profile != BTRFS_BLOCK_GROUP_RAID0 &&
> > +         profile != BTRFS_BLOCK_GROUP_RAID10) {
> > +             /*
> > +              * Detected broken write pointer.  Make this block group
> > +              * unallocatable by setting the allocation pointer at the end of
> > +              * allocatable region. Relocating this block group will fix the
> > +              * mismatch.
> > +              *
> > +              * Currently, we cannot handle RAID0 or RAID10 case like this
> > +              * because we don't have a proper zone_capacity value. But,
> > +              * reading from this block group won't work anyway by a missing
> > +              * stripe.
> > +              */
> > +             cache->alloc_offset = cache->zone_capacity;
>
> Mabe print a warning, this looks like a condition that should be
> communicated back to the user.
indeed, and possible case by bug e.g.:
strace -f mkfs.btrfs -O zoned,raid-stripe-tree -L bak /dev/sdd -f
.....
[pid 12871] ioctl(3, BLKDISCARD, [0, 268435456]) = -1 EOPNOTSUPP
(Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [268435456, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [536870912, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [805306368, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [1073741824, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [1342177280, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [1610612736, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [1879048192, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [2147483648, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [2415919104, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [2684354560, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [2952790016, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [3221225472, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
[pid 12871] ioctl(3, BLKDISCARD, [3489660928, 268435456]) = -1
EOPNOTSUPP (Operation not supported)
Conventional zones are not reset/cleared, the emulated pointers are not cleared

Naohiro Aota Sept. 4, 2024, 12:31 a.m. UTC | #6

On Mon, Sep 02, 2024 at 10:31:27PM GMT, David Sterba wrote:
> On Sat, Aug 31, 2024 at 01:32:49AM +0900, Naohiro Aota wrote:
> > Btrfs rejects to mount a FS if it finds a block group with a broken write
> > pointer (e.g, unequal write pointers on two zones of RAID1 block group).
> > Since such case can happen easily with a power-loss or crash of a system,
> > we need to handle the case more gently.
> > 
> > Handle such block group by making it unallocatable, so that there will be
> > no writes into it. That can be done by setting the allocation pointer at
> > the end of allocating region (= block_group->zone_capacity). Then, existing
> > code handle zone_unusable properly.
> 
> This sounds like the best option, this makes the zone read-only and
> relocation will reset it back to a good state. Alternatives like another
> state or error bits would need tracking them and increase complexity.
> 
> > Having proper zone_capacity is necessary for the change. So, set it as fast
> > as possible.
> > 
> > We cannot handle RAID0 and RAID10 case like this. But, they are anyway
> > unable to read because of a missing stripe.
> > 
> > Fixes: 265f7237dd25 ("btrfs: zoned: allow DUP on meta-data block groups")
> > Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
> > CC: stable@vger.kernel.org # 6.1+
> > Reported-by: HAN Yuwei <hrx@bupt.moe>
> > Cc: Xuefer <xuefer@gmail.com>
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> 
> Reviewed-by: David Sterba <dsterba@suse.com>
> 
> > @@ -1650,6 +1653,23 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
> >  		goto out;
> >  	}
> >  
> > +	if (ret == -EIO && profile != 0 && profile != BTRFS_BLOCK_GROUP_RAID0 &&
> > +	    profile != BTRFS_BLOCK_GROUP_RAID10) {
> > +		/*
> > +		 * Detected broken write pointer.  Make this block group
> > +		 * unallocatable by setting the allocation pointer at the end of
> > +		 * allocatable region. Relocating this block group will fix the
> > +		 * mismatch.
> > +		 *
> > +		 * Currently, we cannot handle RAID0 or RAID10 case like this
> > +		 * because we don't have a proper zone_capacity value. But,
> > +		 * reading from this block group won't work anyway by a missing
> > +		 * stripe.
> > +		 */
> > +		cache->alloc_offset = cache->zone_capacity;
> 
> Mabe print a warning, this looks like a condition that should be
> communicated back to the user.

It already prints "zoned: write pointer offset mismatch of zones in %s
profile". But, I'll add a message to inform that this block group is set
read-only and btrfs balance is desired.

David Sterba Sept. 4, 2024, 6:14 p.m. UTC | #7

On Wed, Sep 04, 2024 at 12:31:34AM +0000, Naohiro Aota wrote:
> On Mon, Sep 02, 2024 at 10:31:27PM GMT, David Sterba wrote:
> > On Sat, Aug 31, 2024 at 01:32:49AM +0900, Naohiro Aota wrote:
> > > Btrfs rejects to mount a FS if it finds a block group with a broken write
> > > pointer (e.g, unequal write pointers on two zones of RAID1 block group).
> > > Since such case can happen easily with a power-loss or crash of a system,
> > > we need to handle the case more gently.
> > > 
> > > Handle such block group by making it unallocatable, so that there will be
> > > no writes into it. That can be done by setting the allocation pointer at
> > > the end of allocating region (= block_group->zone_capacity). Then, existing
> > > code handle zone_unusable properly.
> > 
> > This sounds like the best option, this makes the zone read-only and
> > relocation will reset it back to a good state. Alternatives like another
> > state or error bits would need tracking them and increase complexity.
> > 
> > > Having proper zone_capacity is necessary for the change. So, set it as fast
> > > as possible.
> > > 
> > > We cannot handle RAID0 and RAID10 case like this. But, they are anyway
> > > unable to read because of a missing stripe.
> > > 
> > > Fixes: 265f7237dd25 ("btrfs: zoned: allow DUP on meta-data block groups")
> > > Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
> > > CC: stable@vger.kernel.org # 6.1+
> > > Reported-by: HAN Yuwei <hrx@bupt.moe>
> > > Cc: Xuefer <xuefer@gmail.com>
> > > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > 
> > Reviewed-by: David Sterba <dsterba@suse.com>
> > 
> > > @@ -1650,6 +1653,23 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
> > >  		goto out;
> > >  	}
> > >  
> > > +	if (ret == -EIO && profile != 0 && profile != BTRFS_BLOCK_GROUP_RAID0 &&
> > > +	    profile != BTRFS_BLOCK_GROUP_RAID10) {
> > > +		/*
> > > +		 * Detected broken write pointer.  Make this block group
> > > +		 * unallocatable by setting the allocation pointer at the end of
> > > +		 * allocatable region. Relocating this block group will fix the
> > > +		 * mismatch.
> > > +		 *
> > > +		 * Currently, we cannot handle RAID0 or RAID10 case like this
> > > +		 * because we don't have a proper zone_capacity value. But,
> > > +		 * reading from this block group won't work anyway by a missing
> > > +		 * stripe.
> > > +		 */
> > > +		cache->alloc_offset = cache->zone_capacity;
> > 
> > Mabe print a warning, this looks like a condition that should be
> > communicated back to the user.
> 
> It already prints "zoned: write pointer offset mismatch of zones in %s
> profile". But, I'll add a message to inform that this block group is set
> read-only and btrfs balance is desired.

Ok good that there's another message. I've forwarded the unchanged patch
to master branch so if we really want the additional message please send
it as followup patch. Thanks.

btrfs: zoned: handle broken write pointer on zones

Commit Message

Comments

Patch