diff mbox series

[v4,08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets

Message ID 20220516165416.171196-9-p.raghav@samsung.com (mailing list archive)
State New, archived
Headers show
Series [v4,01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze | expand

Commit Message

Pankaj Raghav May 16, 2022, 4:54 p.m. UTC
Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
These are fixed at these locations so that recovery tools can reliably
retrieve the superblocks even if one of the mirror gets corrupted.

power of 2 zone sizes align at these offsets irrespective of their
value but non power of 2 zone sizes will not align.

To make sure the first zone at mirror 1 and mirror 2 align, write zero
operation is performed to move the write pointer of the first zone to
the expected offset. This operation is performed only after a zone reset
of the first zone, i.e., when the second zone that contains the sb is FULL.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 63 insertions(+), 5 deletions(-)

Comments

Johannes Thumshirn May 17, 2022, 6:50 a.m. UTC | #1
On 16/05/2022 18:55, Pankaj Raghav wrote:
> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
> These are fixed at these locations so that recovery tools can reliably
> retrieve the superblocks even if one of the mirror gets corrupted.
> 
> power of 2 zone sizes align at these offsets irrespective of their
> value but non power of 2 zone sizes will not align.
> 
> To make sure the first zone at mirror 1 and mirror 2 align, write zero
> operation is performed to move the write pointer of the first zone to
> the expected offset. This operation is performed only after a zone reset
> of the first zone, i.e., when the second zone that contains the sb is FULL.

Hi Pankaj, stupid question. Npo2 devices still have a zone size being a 
multiple of 4k don't they?

If not, we'd need to also have a tail padding of the superblock zones, in order
to move the WP of these zones to the end, so the sb-log states match up.
Pankaj Raghav May 17, 2022, 8 a.m. UTC | #2
On 2022-05-17 08:50, Johannes Thumshirn wrote:
> On 16/05/2022 18:55, Pankaj Raghav wrote:
>> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
>> These are fixed at these locations so that recovery tools can reliably
>> retrieve the superblocks even if one of the mirror gets corrupted.
>>
>> power of 2 zone sizes align at these offsets irrespective of their
>> value but non power of 2 zone sizes will not align.
>>
>> To make sure the first zone at mirror 1 and mirror 2 align, write zero
>> operation is performed to move the write pointer of the first zone to
>> the expected offset. This operation is performed only after a zone reset
>> of the first zone, i.e., when the second zone that contains the sb is FULL.
> 
> Hi Pankaj, stupid question. Npo2 devices still have a zone size being a 
> multiple of 4k don't they?
> 
> If not, we'd need to also have a tail padding of the superblock zones, in order
> to move the WP of these zones to the end, so the sb-log states match up.
Hi Johannes,
NPO2 zoned devices has a minimum alignment requirement of 1MB. See
commit: `btrfs: zoned: relax the alignment constraint for zoned devices`

It will naturally align to 4k. So I don't think we need special handling
there with tail padding.
David Sterba May 17, 2022, 12:42 p.m. UTC | #3
On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote:
> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
> These are fixed at these locations so that recovery tools can reliably
> retrieve the superblocks even if one of the mirror gets corrupted.
> 
> power of 2 zone sizes align at these offsets irrespective of their
> value but non power of 2 zone sizes will not align.
> 
> To make sure the first zone at mirror 1 and mirror 2 align, write zero
> operation is performed to move the write pointer of the first zone to
> the expected offset. This operation is performed only after a zone reset
> of the first zone, i.e., when the second zone that contains the sb is FULL.

Is it a good idea to do the "write zeros", instead of a plain "set write
pointer"? I assume setting write pointer is instant, while writing
potentially hundreds of megabytes may take significiant time. As the
functions may be called from random contexts, the increased time may
become a problem.

> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 63 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 3023c871e..805aeaa76 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
>  	return 0;
>  }
>  
> +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone,
> +			     int mirror, u64 *wp_ret)
> +{
> +	u64 offset = 0;
> +	int ret = 0;
> +
> +	ASSERT(!is_power_of_two_u64(zone->len));
> +	ASSERT(zone->wp == zone->start);
> +	ASSERT(mirror != 0);

This could simply accept 0 as the mirror offset too, the calculation is
trivial.

> +
> +	switch (mirror) {
> +	case 1:
> +		div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT,
> +			      zone->len, &offset);
> +		break;
> +	case 2:
> +		div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT,
> +			      zone->len, &offset);
> +		break;
> +	}
> +
> +	ret =  blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0);
> +	if (ret)
> +		return ret;
> +
> +	zone->wp += offset;
> +	zone->cond = BLK_ZONE_COND_IMP_OPEN;
> +	*wp_ret = zone->wp << SECTOR_SHIFT;
> +
> +	return 0;
> +}
> +
>  static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
> -			   int rw, u64 *bytenr_ret)
> +			   int rw, int mirror, u64 *bytenr_ret)
>  {
>  	u64 wp;
>  	int ret;
> +	bool zones_empty = false;
>  
>  	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
>  		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
> @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>  	if (ret != -ENOENT && ret < 0)
>  		return ret;
>  
> +	if (ret == -ENOENT)
> +		zones_empty = true;
> +
>  	if (rw == WRITE) {
>  		struct blk_zone *reset = NULL;
> +		bool is_sb_offset_write_req = false;
> +		u32 reset_zone_nr = -1;
>  
> -		if (wp == zones[0].start << SECTOR_SHIFT)
> +		if (wp == zones[0].start << SECTOR_SHIFT) {
>  			reset = &zones[0];
> -		else if (wp == zones[1].start << SECTOR_SHIFT)
> +			reset_zone_nr = 0;
> +		} else if (wp == zones[1].start << SECTOR_SHIFT) {
>  			reset = &zones[1];
> +			reset_zone_nr = 1;
> +		}
> +
> +		/*
> +		 * Non po2 zone sizes will not align naturally at
> +		 * mirror 1 (512GB) and mirror 2 (4TB). The wp of the
> +		 * 1st zone in those superblock mirrors need to be
> +		 * moved to align at those offsets.
> +		 */

Please move this comment to the helper fill_sb_wp_offset itself, there
it's more discoverable.

> +		is_sb_offset_write_req =
> +			(zones_empty || (reset_zone_nr == 0)) && mirror &&
> +			!is_power_of_2(zones[0].len);

Accepting 0 as the mirror number would also get rid of this wild
expression substituting and 'if'.

>  
>  		if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
>  			ASSERT(sb_zone_is_full(reset));
> @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>  			reset->cond = BLK_ZONE_COND_EMPTY;
>  			reset->wp = reset->start;
>  		}
> +
> +		if (is_sb_offset_write_req) {

And get rid of the conditional. The point of supporting both po2 and
nonpo2 is to hide any implementation details to wrappers as much as
possible.

> +			ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp);
> +			if (ret)
> +				return ret;
> +		}
> +
>  	} else if (ret != -ENOENT) {
>  		/*
>  		 * For READ, we want the previous one. Move write pointer to
Pankaj Raghav May 18, 2022, 9:15 a.m. UTC | #4
On 2022-05-17 14:42, David Sterba wrote:
> On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote:
>> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
>> These are fixed at these locations so that recovery tools can reliably
>> retrieve the superblocks even if one of the mirror gets corrupted.
>>
>> power of 2 zone sizes align at these offsets irrespective of their
>> value but non power of 2 zone sizes will not align.
>>
>> To make sure the first zone at mirror 1 and mirror 2 align, write zero
>> operation is performed to move the write pointer of the first zone to
>> the expected offset. This operation is performed only after a zone reset
>> of the first zone, i.e., when the second zone that contains the sb is FULL.
> 
> Is it a good idea to do the "write zeros", instead of a plain "set write
> pointer"? I assume setting write pointer is instant, while writing
> potentially hundreds of megabytes may take significiant time. As the
> functions may be called from random contexts, the increased time may
> become a problem.
> 
Unfortunately it is not possible to just move the WP in zoned devices.
The only alternative that I could use is to do write zeroes which are
natively supported by some devices such as ZNS. It would be nice to know
if someone had a better solution to this instead of doing write zeroes
in zoned devices.

>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> ---
>>  fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 63 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
>> index 3023c871e..805aeaa76 100644
>> --- a/fs/btrfs/zoned.c
>> +++ b/fs/btrfs/zoned.c
>> @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
>>  	return 0;
>>  }
>>  
>> +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone,
>> +			     int mirror, u64 *wp_ret)
>> +{
>> +	u64 offset = 0;
>> +	int ret = 0;
>> +
>> +	ASSERT(!is_power_of_two_u64(zone->len));
>> +	ASSERT(zone->wp == zone->start);
>> +	ASSERT(mirror != 0);
> 
> This could simply accept 0 as the mirror offset too, the calculation is
> trivial.
> 
Ok. I will fix it up!
>> +
>> +	switch (mirror) {
>> +	case 1:
>> +		div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT,
>> +			      zone->len, &offset);
>> +		break;
>> +	case 2:
>> +		div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT,
>> +			      zone->len, &offset);
>> +		break;
>> +	}
>> +
>> +	ret =  blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0);
>> +	if (ret)
>> +		return ret;
>> +
>> +		/*
>> +		 * Non po2 zone sizes will not align naturally at
>> +		 * mirror 1 (512GB) and mirror 2 (4TB). The wp of the
>> +		 * 1st zone in those superblock mirrors need to be
>> +		 * moved to align at those offsets.
>> +		 */
> 
> Please move this comment to the helper fill_sb_wp_offset itself, there
> it's more discoverable.
> 
Ok.
>> +		is_sb_offset_write_req =
>> +			(zones_empty || (reset_zone_nr == 0)) && mirror &&
>> +			!is_power_of_2(zones[0].len);
> 
> Accepting 0 as the mirror number would also get rid of this wild
> expression substituting and 'if'.
> 
>>  
>>  		if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
>>  			ASSERT(sb_zone_is_full(reset));
>> @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>>  			reset->cond = BLK_ZONE_COND_EMPTY;
>>  			reset->wp = reset->start;
>>  		}
>> +
>> +		if (is_sb_offset_write_req) {
> 
> And get rid of the conditional. The point of supporting both po2 and
> nonpo2 is to hide any implementation details to wrappers as much as
> possible.
> 
Alright. I will move the logic to the wrapper instead of having the
conditional in this function.
>> +			ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +
>>  	} else if (ret != -ENOENT) {
>>  		/*
>>  		 * For READ, we want the previous one. Move write pointer to
Thanks for your comments.
Johannes Thumshirn May 19, 2022, 7:57 a.m. UTC | #5
On 18/05/2022 11:17, Pankaj Raghav wrote:
> On 2022-05-17 14:42, David Sterba wrote:
>> On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote:
>>> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
>>> These are fixed at these locations so that recovery tools can reliably
>>> retrieve the superblocks even if one of the mirror gets corrupted.
>>>
>>> power of 2 zone sizes align at these offsets irrespective of their
>>> value but non power of 2 zone sizes will not align.
>>>
>>> To make sure the first zone at mirror 1 and mirror 2 align, write zero
>>> operation is performed to move the write pointer of the first zone to
>>> the expected offset. This operation is performed only after a zone reset
>>> of the first zone, i.e., when the second zone that contains the sb is FULL.
>> Is it a good idea to do the "write zeros", instead of a plain "set write
>> pointer"? I assume setting write pointer is instant, while writing
>> potentially hundreds of megabytes may take significiant time. As the
>> functions may be called from random contexts, the increased time may
>> become a problem.
>>
> Unfortunately it is not possible to just move the WP in zoned devices.
> The only alternative that I could use is to do write zeroes which are
> natively supported by some devices such as ZNS. It would be nice to know
> if someone had a better solution to this instead of doing write zeroes
> in zoned devices.
> 

I have another question. In case we need to pad the sb zone with a write
zeros and have a power fail between the write-zeros and the regular 
super-block write, what happens? I know this padding is only done for the
backup super blocks, never the less it can happen and it can happen when
the primary super block is also corrupted.

AFAIU we're then trying to reach out for a backup super block, look at the
write pointer and it only contains zeros but no super block, as only the 
write-zeros has reached the device and not the super block write.

How is this situation handled?

Thanks,
	Johannes
Naohiro Aota May 19, 2022, 7:59 a.m. UTC | #6
On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote:
> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
> These are fixed at these locations so that recovery tools can reliably
> retrieve the superblocks even if one of the mirror gets corrupted.
> 
> power of 2 zone sizes align at these offsets irrespective of their
> value but non power of 2 zone sizes will not align.
> 
> To make sure the first zone at mirror 1 and mirror 2 align, write zero
> operation is performed to move the write pointer of the first zone to
> the expected offset. This operation is performed only after a zone reset
> of the first zone, i.e., when the second zone that contains the sb is FULL.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 63 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 3023c871e..805aeaa76 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
>  	return 0;
>  }
>  
> +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone,
> +			     int mirror, u64 *wp_ret)
> +{
> +	u64 offset = 0;
> +	int ret = 0;
> +
> +	ASSERT(!is_power_of_two_u64(zone->len));
> +	ASSERT(zone->wp == zone->start);
> +	ASSERT(mirror != 0);
> +
> +	switch (mirror) {
> +	case 1:
> +		div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT,
> +			      zone->len, &offset);
> +		break;
> +	case 2:
> +		div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT,
> +			      zone->len, &offset);
> +		break;
> +	}
> +
> +	ret =  blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0);
> +	if (ret)
> +		return ret;
> +
> +	zone->wp += offset;
> +	zone->cond = BLK_ZONE_COND_IMP_OPEN;
> +	*wp_ret = zone->wp << SECTOR_SHIFT;
> +
> +	return 0;
> +}
> +
>  static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
> -			   int rw, u64 *bytenr_ret)
> +			   int rw, int mirror, u64 *bytenr_ret)
>  {
>  	u64 wp;
>  	int ret;
> +	bool zones_empty = false;
>  
>  	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
>  		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
> @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>  	if (ret != -ENOENT && ret < 0)
>  		return ret;
>  
> +	if (ret == -ENOENT)
> +		zones_empty = true;
> +

I think, we don't need this. We need to issue the zeroout when
zones[0]->cond == BLK_ZONE_COND_EMPTY && !is_power_of_2(...) after sending
ZONE_RESET if necessary. No?

>  	if (rw == WRITE) {
>  		struct blk_zone *reset = NULL;
> +		bool is_sb_offset_write_req = false;
> +		u32 reset_zone_nr = -1;
>  
> -		if (wp == zones[0].start << SECTOR_SHIFT)
> +		if (wp == zones[0].start << SECTOR_SHIFT) {
>  			reset = &zones[0];
> -		else if (wp == zones[1].start << SECTOR_SHIFT)
> +			reset_zone_nr = 0;
> +		} else if (wp == zones[1].start << SECTOR_SHIFT) {
>  			reset = &zones[1];
> +			reset_zone_nr = 1;
> +		}
> +
> +		/*
> +		 * Non po2 zone sizes will not align naturally at
> +		 * mirror 1 (512GB) and mirror 2 (4TB). The wp of the
> +		 * 1st zone in those superblock mirrors need to be
> +		 * moved to align at those offsets.
> +		 */
> +		is_sb_offset_write_req =
> +			(zones_empty || (reset_zone_nr == 0)) && mirror &&
> +			!is_power_of_2(zones[0].len);
>  
>  		if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
>  			ASSERT(sb_zone_is_full(reset));
> @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>  			reset->cond = BLK_ZONE_COND_EMPTY;
>  			reset->wp = reset->start;
>  		}
> +
> +		if (is_sb_offset_write_req) {
> +			ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp);
> +			if (ret)
> +				return ret;
> +		}
> +
>  	} else if (ret != -ENOENT) {
>  		/*
>  		 * For READ, we want the previous one. Move write pointer to
> @@ -851,7 +909,7 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
>  	if (ret != BTRFS_NR_SB_LOG_ZONES)
>  		return -EIO;
>  
> -	return sb_log_location(bdev, zones, rw, bytenr_ret);
> +	return sb_log_location(bdev, zones, rw, mirror, bytenr_ret);
>  }
>  
>  int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
> @@ -877,7 +935,7 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
>  
>  	return sb_log_location(device->bdev,
>  			       &zinfo->sb_zones[BTRFS_NR_SB_LOG_ZONES * mirror],
> -			       rw, bytenr_ret);
> +			       rw, mirror, bytenr_ret);
>  }
>  
>  static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,
> -- 
> 2.25.1
>
Pankaj Raghav May 20, 2022, 9:06 a.m. UTC | #7
On 5/19/22 09:57, Johannes Thumshirn wrote:
>> Unfortunately it is not possible to just move the WP in zoned devices.
>> The only alternative that I could use is to do write zeroes which are
>> natively supported by some devices such as ZNS. It would be nice to know
>> if someone had a better solution to this instead of doing write zeroes
>> in zoned devices.
>>
> 
> I have another question. In case we need to pad the sb zone with a write
> zeros and have a power fail between the write-zeros and the regular 
> super-block write, what happens? I know this padding is only done for the
> backup super blocks, never the less it can happen and it can happen when
> the primary super block is also corrupted.
> 
> AFAIU we're then trying to reach out for a backup super block, look at the
> write pointer and it only contains zeros but no super block, as only the 
> write-zeros has reached the device and not the super block write.
> 
> How is this situation handled?
> 
That is a very good point. I did think about this situation while adding
padding to the mirror superblock with write zeroes. If the drive is
**less than 4TB** and with the **primary superblock corrupted**, then it
will be an issue with the situation you have described for npo2 drives.
That situation is not handled here. Ofc this is not an issue when we
have the second mirror at 4TB for bigger drives. Do you have some ideas
in mind for this failure mode?
> Thanks,
> 	Johannes
Pankaj Raghav May 20, 2022, 9:09 a.m. UTC | #8
On 5/19/22 09:59, Naohiro Aota wrote:
>> +
>>  static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>> -			   int rw, u64 *bytenr_ret)
>> +			   int rw, int mirror, u64 *bytenr_ret)
>>  {
>>  	u64 wp;
>>  	int ret;
>> +	bool zones_empty = false;
>>  
>>  	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
>>  		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
>> @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>>  	if (ret != -ENOENT && ret < 0)
>>  		return ret;
>>  
>> +	if (ret == -ENOENT)
>> +		zones_empty = true;
>> +
> 
> I think, we don't need this. We need to issue the zeroout when
> zones[0]->cond == BLK_ZONE_COND_EMPTY && !is_power_of_2(...) after sending
> ZONE_RESET if necessary. No?
> 
Yeah. I added this extra check to cover all the cases possible. But you
are right that it is enough to issue zeroout only for this condition:
zones[0]->cond == BLK_ZONE_COND_EMPTY && !is_power_of_2(...) as both the
zones empty is not likely to happen.
>>  	if (rw == WRITE) {
>>  		struct blk_zone *reset = NULL;
>> +		bool is_sb_offset_write_req = false;
>> +		u32 reset_zone_nr = -1;
>>  
>> -		if (wp == zones[0].start << SECTOR_SHIFT)
>> +		if (wp == zones[0].start << SECTOR_SHIFT) {
>>  			reset = &zones[0];
>> -		else if (wp == zones[1].start << SECTOR_SHIFT)
>> +			reset_zone_nr = 0;
Johannes Thumshirn May 20, 2022, 9:15 a.m. UTC | #9
On 20/05/2022 11:07, Pankaj Raghav wrote:
> On 5/19/22 09:57, Johannes Thumshirn wrote:
>>> Unfortunately it is not possible to just move the WP in zoned devices.
>>> The only alternative that I could use is to do write zeroes which are
>>> natively supported by some devices such as ZNS. It would be nice to know
>>> if someone had a better solution to this instead of doing write zeroes
>>> in zoned devices.
>>>
>>
>> I have another question. In case we need to pad the sb zone with a write
>> zeros and have a power fail between the write-zeros and the regular 
>> super-block write, what happens? I know this padding is only done for the
>> backup super blocks, never the less it can happen and it can happen when
>> the primary super block is also corrupted.
>>
>> AFAIU we're then trying to reach out for a backup super block, look at the
>> write pointer and it only contains zeros but no super block, as only the 
>> write-zeros has reached the device and not the super block write.
>>
>> How is this situation handled?
>>
> That is a very good point. I did think about this situation while adding
> padding to the mirror superblock with write zeroes. If the drive is
> **less than 4TB** and with the **primary superblock corrupted**, then it
> will be an issue with the situation you have described for npo2 drives.
> That situation is not handled here. Ofc this is not an issue when we
> have the second mirror at 4TB for bigger drives. Do you have some ideas
> in mind for this failure mode?

The only idea I have for this is creating a bounce buffer, write the padding
and the super-block into the buffer and then submit it. But that's too ugly
to live.

And it would involve changing non-zoned super-block writing code, which I think
is way to risky.
diff mbox series

Patch

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 3023c871e..805aeaa76 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -760,11 +760,44 @@  int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
 	return 0;
 }
 
+static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone,
+			     int mirror, u64 *wp_ret)
+{
+	u64 offset = 0;
+	int ret = 0;
+
+	ASSERT(!is_power_of_two_u64(zone->len));
+	ASSERT(zone->wp == zone->start);
+	ASSERT(mirror != 0);
+
+	switch (mirror) {
+	case 1:
+		div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT,
+			      zone->len, &offset);
+		break;
+	case 2:
+		div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT,
+			      zone->len, &offset);
+		break;
+	}
+
+	ret =  blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0);
+	if (ret)
+		return ret;
+
+	zone->wp += offset;
+	zone->cond = BLK_ZONE_COND_IMP_OPEN;
+	*wp_ret = zone->wp << SECTOR_SHIFT;
+
+	return 0;
+}
+
 static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
-			   int rw, u64 *bytenr_ret)
+			   int rw, int mirror, u64 *bytenr_ret)
 {
 	u64 wp;
 	int ret;
+	bool zones_empty = false;
 
 	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
 		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
@@ -775,13 +808,31 @@  static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
 	if (ret != -ENOENT && ret < 0)
 		return ret;
 
+	if (ret == -ENOENT)
+		zones_empty = true;
+
 	if (rw == WRITE) {
 		struct blk_zone *reset = NULL;
+		bool is_sb_offset_write_req = false;
+		u32 reset_zone_nr = -1;
 
-		if (wp == zones[0].start << SECTOR_SHIFT)
+		if (wp == zones[0].start << SECTOR_SHIFT) {
 			reset = &zones[0];
-		else if (wp == zones[1].start << SECTOR_SHIFT)
+			reset_zone_nr = 0;
+		} else if (wp == zones[1].start << SECTOR_SHIFT) {
 			reset = &zones[1];
+			reset_zone_nr = 1;
+		}
+
+		/*
+		 * Non po2 zone sizes will not align naturally at
+		 * mirror 1 (512GB) and mirror 2 (4TB). The wp of the
+		 * 1st zone in those superblock mirrors need to be
+		 * moved to align at those offsets.
+		 */
+		is_sb_offset_write_req =
+			(zones_empty || (reset_zone_nr == 0)) && mirror &&
+			!is_power_of_2(zones[0].len);
 
 		if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
 			ASSERT(sb_zone_is_full(reset));
@@ -795,6 +846,13 @@  static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
 			reset->cond = BLK_ZONE_COND_EMPTY;
 			reset->wp = reset->start;
 		}
+
+		if (is_sb_offset_write_req) {
+			ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp);
+			if (ret)
+				return ret;
+		}
+
 	} else if (ret != -ENOENT) {
 		/*
 		 * For READ, we want the previous one. Move write pointer to
@@ -851,7 +909,7 @@  int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
 	if (ret != BTRFS_NR_SB_LOG_ZONES)
 		return -EIO;
 
-	return sb_log_location(bdev, zones, rw, bytenr_ret);
+	return sb_log_location(bdev, zones, rw, mirror, bytenr_ret);
 }
 
 int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
@@ -877,7 +935,7 @@  int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
 
 	return sb_log_location(device->bdev,
 			       &zinfo->sb_zones[BTRFS_NR_SB_LOG_ZONES * mirror],
-			       rw, bytenr_ret);
+			       rw, mirror, bytenr_ret);
 }
 
 static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,