diff mbox series

[v2,2/3] btrfs: zoned: move superblock logging zone location

Message ID fe07f3ca7b17b6739cff8ab228d57bdbea0c447b.1614760899.git.naohiro.aota@wdc.com (mailing list archive)
State New, archived
Headers show
Series Fixes for zoned mode | expand

Commit Message

Naohiro Aota March 3, 2021, 8:55 a.m. UTC
This commit moves the location of superblock logging zones basing on the
fixed address instead of the fixed zone number.

By locating the superblock zones using fixed addresses, we can scan a
dumped file system image without the zone information. And, no drawbacks
exist.

The following zones are reserved as the circular buffer on zoned btrfs.
  - The primary superblock: zone at LBA 0 and the next zone
  - The first copy: zone at LBA 16G and the next zone
  - The second copy: zone at LBA 256G and the next zone

If the location of the zones are outside of disk, we don't record the
superblock copy.

The addresses are much larger than the usual superblock copies locations.
The copies' locations are decided to support possible future larger zone
size, not to overlap the log zones. We support zone size up to 8GB.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/zoned.c | 37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)

Comments

David Sterba March 4, 2021, 3:14 p.m. UTC | #1
On Wed, Mar 03, 2021 at 05:55:47PM +0900, Naohiro Aota wrote:
> This commit moves the location of superblock logging zones basing on the
> fixed address instead of the fixed zone number.
> 
> By locating the superblock zones using fixed addresses, we can scan a
> dumped file system image without the zone information. And, no drawbacks
> exist.
> 
> The following zones are reserved as the circular buffer on zoned btrfs.
>   - The primary superblock: zone at LBA 0 and the next zone
>   - The first copy: zone at LBA 16G and the next zone
>   - The second copy: zone at LBA 256G and the next zone
> 
> If the location of the zones are outside of disk, we don't record the
> superblock copy.
> 
> The addresses are much larger than the usual superblock copies locations.
> The copies' locations are decided to support possible future larger zone
> size, not to overlap the log zones. We support zone size up to 8GB.

One thing I don't see is that the reserved space for superblock is fixed
regardless of the actual device zone size. In exclude_super_stripes.

0-16G for primary
... and now what, 16G would be the next copy thus reserving 16 up to 32G

So the 64G offset for the 1st copy is more suitable:

0    -  16G primary
64G  -  80G 1st copy
256G - 272G 2nd copy

This still does not sound great because it just builds on the original
offsets from 10 years ago.  The device sizes are expected to be in
terabytes but all the superblocks are in the first terabyte.

What if we do that like

0   -  16G
1T  -  1T+16G
8T  -  8T+16G

The HDD sizes start somewhere at 4T so the first two copies cover the
small sizes, larger have all three copies. But we could go wild even
more, like 0/4T/16T.

I'm not sure if the capacities for non-HDD are going to be also that
large, I could not find anything specific, the only existing ZNS is some
DC ZN540 but no details.

We need to get this right (best effort), so I'll postpone this patch
until it's all sorted.
Damien Le Moal March 4, 2021, 11 p.m. UTC | #2
On 2021/03/05 0:20, David Sterba wrote:
> On Wed, Mar 03, 2021 at 05:55:47PM +0900, Naohiro Aota wrote:
>> This commit moves the location of superblock logging zones basing on the
>> fixed address instead of the fixed zone number.
>>
>> By locating the superblock zones using fixed addresses, we can scan a
>> dumped file system image without the zone information. And, no drawbacks
>> exist.
>>
>> The following zones are reserved as the circular buffer on zoned btrfs.
>>   - The primary superblock: zone at LBA 0 and the next zone
>>   - The first copy: zone at LBA 16G and the next zone
>>   - The second copy: zone at LBA 256G and the next zone
>>
>> If the location of the zones are outside of disk, we don't record the
>> superblock copy.
>>
>> The addresses are much larger than the usual superblock copies locations.
>> The copies' locations are decided to support possible future larger zone
>> size, not to overlap the log zones. We support zone size up to 8GB.
> 
> One thing I don't see is that the reserved space for superblock is fixed
> regardless of the actual device zone size. In exclude_super_stripes.
> 
> 0-16G for primary
> ... and now what, 16G would be the next copy thus reserving 16 up to 32G
> 
> So the 64G offset for the 1st copy is more suitable:
> 
> 0    -  16G primary
> 64G  -  80G 1st copy
> 256G - 272G 2nd copy
> 
> This still does not sound great because it just builds on the original
> offsets from 10 years ago.  The device sizes are expected to be in
> terabytes but all the superblocks are in the first terabyte.

I do not see an issue with that. For HDDs, one would ideally want each copy
under a different head but determining which head serves which LBA is not
possible with standard commands. LBAs are generally distributed initially across
one head (platter side) up to one or more zones, then goes on the next head
backward (other side of the same platter), and on to the following head/platter.
So distribution is first vertical then goes inward (and when reaching middle of
the platter, everything starts again from the spindle outward).

0/64G/256G likely gives you different heads. No way to tell for certain though.

> What if we do that like
> 
> 0   -  16G
> 1T  -  1T+16G
> 8T  -  8T+16G
> 
> The HDD sizes start somewhere at 4T so the first two copies cover the
> small sizes, larger have all three copies. But we could go wild even
> more, like 0/4T/16T.

That would work for HDDs. We are at 20T with SMR now and the lowest SMR capacity
is 14T. For regular disks, yes, 4T is kind of the starting point for enterprise
drives. Consumer/NAS drives can start lower though, at 1T or 2T.
To be able able to cover all cases nicely, I would suggest not exceeding 1T for
the SB copies.

> I'm not sure if the capacities for non-HDD are going to be also that
> large, I could not find anything specific, the only existing ZNS is some
> DC ZN540 but no details.

That one is sampling at 2T capacity now. This likely will be a lower boundary
and higher capacities will be available. Not sure yet up to what point. Likely,
different models at 4T, 6T, 8T, 16T... will be available. So kind of the same
story as for HDDs. Keeping the SB copies within the first TB will allow
supporting all models.

So I kind of like your initial suggestion:

0    -  16G primary
64G  -  80G 1st copy
256G - 272G 2nd copy

And we could even do:

0    -  16G primary
128G - 160G 1st copy
512G - 544G 2nd copy

Which would also safely allow larger zone sizes beyond 8G for ZNS too.
(I do not think this will happen anytime soon though, but with these values, we
are safer).

> 
> We need to get this right (best effort), so I'll postpone this patch
> until it's all sorted.
>
diff mbox series

Patch

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1324bb6c3946..b8f50dc9fbb0 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -24,6 +24,15 @@ 
 /* Number of superblock log zones */
 #define BTRFS_NR_SB_LOG_ZONES 2
 
+/* Location of superblock log zones */
+#define BTRFS_FIRST_SB_LOG_ZONE SZ_16G
+#define BTRFS_SECOND_SB_LOG_ZONE (256ULL * SZ_1G)
+#define BTRFS_FIRST_SB_LOG_ZONE_SHIFT const_ilog2(BTRFS_FIRST_SB_LOG_ZONE)
+#define BTRFS_SECOND_SB_LOG_ZONE_SHIFT const_ilog2(BTRFS_SECOND_SB_LOG_ZONE)
+
+/* Max size of supported zone size */
+#define BTRFS_MAX_ZONE_SIZE SZ_8G
+
 static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, void *data)
 {
 	struct blk_zone *zones = data;
@@ -112,10 +121,9 @@  static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
 
 /*
  * The following zones are reserved as the circular buffer on ZONED btrfs.
- *  - The primary superblock: zones 0 and 1
- *  - The first copy: zones 16 and 17
- *  - The second copy: zones 1024 or zone at 256GB which is minimum, and
- *                     the following one
+ *  - The primary superblock: zone at LBA 0 and the next zone
+ *  - The first copy: zone at LBA 16G and the next zone
+ *  - The second copy: zone at LBA 256G and the next zone
  */
 static inline u32 sb_zone_number(int shift, int mirror)
 {
@@ -123,8 +131,8 @@  static inline u32 sb_zone_number(int shift, int mirror)
 
 	switch (mirror) {
 	case 0: return 0;
-	case 1: return 16;
-	case 2: return min_t(u64, btrfs_sb_offset(mirror) >> shift, 1024);
+	case 1: return 1 << (BTRFS_FIRST_SB_LOG_ZONE_SHIFT - shift);
+	case 2: return 1 << (BTRFS_SECOND_SB_LOG_ZONE_SHIFT - shift);
 	}
 
 	return 0;
@@ -300,10 +308,25 @@  int btrfs_get_dev_zone_info(struct btrfs_device *device)
 		zone_sectors = bdev_zone_sectors(bdev);
 	}
 
-	nr_sectors = bdev_nr_sectors(bdev);
 	/* Check if it's power of 2 (see is_power_of_2) */
 	ASSERT(zone_sectors != 0 && (zone_sectors & (zone_sectors - 1)) == 0);
 	zone_info->zone_size = zone_sectors << SECTOR_SHIFT;
+
+	/*
+	 * We must reject devices with a zone size larger than 8GB to avoid
+	 * overlap between the super block primary and first copy fixed log
+	 * locations.
+	 */
+	if (zone_info->zone_size > BTRFS_MAX_ZONE_SIZE) {
+		btrfs_err_in_rcu(fs_info,
+				 "zoned: %s: zone size %llu is too large",
+				 rcu_str_deref(device->name),
+				 zone_info->zone_size);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	nr_sectors = bdev_nr_sectors(bdev);
 	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
 	zone_info->max_zone_append_size =
 		(u64)queue_max_zone_append_sectors(queue) << SECTOR_SHIFT;