diff mbox series

[v2] btrfs: zoned: move superblock logging zone location

Message ID 2f58edb74695825632c77349b000d31f16cb3226.1617870145.git.naohiro.aota@wdc.com (mailing list archive)
State New, archived
Headers show
Series [v2] btrfs: zoned: move superblock logging zone location | expand

Commit Message

Naohiro Aota April 8, 2021, 8:25 a.m. UTC
This commit moves the location of the superblock logging zones. The new
locations of the logging zones are now determined based on fixed block
addresses instead of on fixed zone numbers.

The old placement method based on fixed zone numbers causes problems when
one needs to inspect a file system image without access to the drive zone
information. In such case, the super block locations cannot be reliably
determined as the zone size is unknown. By locating the superblock logging
zones using fixed addresses, we can scan a dumped file system image without
the zone information since a super block copy will always be present at or
after the fixed location.

This commit introduces the following three pairs of zones containing fixed
offset locations, regardless of the device zone size.

  - Primary superblock: zone starting at offset 0 and the following zone
  - First copy: zone containing offset 64GB and the following zone
  - Second copy: zone containing offset 256GB and the following zone

If a logging zone is outside of the disk capacity, we do not record the
superblock copy.

The first copy position is much larger than for a regular btrfs volume
(64M).  This increase is to avoid overlapping with the log zones for the
primary superblock. This higher location is arbitrary but allows supporting
devices with very large zone sizes, up to 32GB. Such large zone size is
unrealistic and very unlikely to ever be seen in real devices. Currently,
SMR disks have a zone size of 256MB, and we are expecting ZNS drives to be
in the 1-4GB range, so this 32GB limit gives us room to breathe. For now,
we only allow zone sizes up to 8GB, below this hard limit of 32GB.

The fixed location addresses are somewhat arbitrary, but with the intent of
maintaining superblock reliability even for smaller devices. For this
reason, the superblock fixed locations do not exceed 1TB.

The superblock logging zones are reserved for superblock logging and never
used for data or metadata blocks. Note that we only reserve the two zones
per primary/copy actually used for superblock logging. We do not reserve
the ranges of zones possibly containing superblocks with the largest
supported zone size (0-16GB, 64G-80GB, 256G-272GB).

The zones containing the fixed location offsets used to store superblocks
in a regular btrfs volume (no zoned case) are also reserved to avoid
confusion.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/zoned.c | 43 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 35 insertions(+), 8 deletions(-)

Comments

Josef Bacik April 8, 2021, 2:57 p.m. UTC | #1
On 4/8/21 4:25 AM, Naohiro Aota wrote:
> This commit moves the location of the superblock logging zones. The new
> locations of the logging zones are now determined based on fixed block
> addresses instead of on fixed zone numbers.
> 
> The old placement method based on fixed zone numbers causes problems when
> one needs to inspect a file system image without access to the drive zone
> information. In such case, the super block locations cannot be reliably
> determined as the zone size is unknown. By locating the superblock logging
> zones using fixed addresses, we can scan a dumped file system image without
> the zone information since a super block copy will always be present at or
> after the fixed location.
> 
> This commit introduces the following three pairs of zones containing fixed
> offset locations, regardless of the device zone size.
> 
>    - Primary superblock: zone starting at offset 0 and the following zone
>    - First copy: zone containing offset 64GB and the following zone
>    - Second copy: zone containing offset 256GB and the following zone
> 
> If a logging zone is outside of the disk capacity, we do not record the
> superblock copy.
> 
> The first copy position is much larger than for a regular btrfs volume
> (64M).  This increase is to avoid overlapping with the log zones for the
> primary superblock. This higher location is arbitrary but allows supporting
> devices with very large zone sizes, up to 32GB. Such large zone size is
> unrealistic and very unlikely to ever be seen in real devices. Currently,
> SMR disks have a zone size of 256MB, and we are expecting ZNS drives to be
> in the 1-4GB range, so this 32GB limit gives us room to breathe. For now,
> we only allow zone sizes up to 8GB, below this hard limit of 32GB.
> 
> The fixed location addresses are somewhat arbitrary, but with the intent of
> maintaining superblock reliability even for smaller devices. For this
> reason, the superblock fixed locations do not exceed 1TB.
> 
> The superblock logging zones are reserved for superblock logging and never
> used for data or metadata blocks. Note that we only reserve the two zones
> per primary/copy actually used for superblock logging. We do not reserve
> the ranges of zones possibly containing superblocks with the largest
> supported zone size (0-16GB, 64G-80GB, 256G-272GB).
> 
> The zones containing the fixed location offsets used to store superblocks
> in a regular btrfs volume (no zoned case) are also reserved to avoid
> confusion.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Thanks Naohiro, this makes it much easier to understand,

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Dave, I know you're on vacation, this needs to go to Linus for this cycle so we 
don't have two different SB slots for two different kernels.  I don't want to 
disturb your vacation, so I'll put this in a pull request tomorrow to Linus.  If 
you already had plans to do this let me know and I'll hold off.  Thanks,

Josef
David Sterba April 9, 2021, 10:48 a.m. UTC | #2
On Thu, Apr 08, 2021 at 10:57:32AM -0400, Josef Bacik wrote:
> On 4/8/21 4:25 AM, Naohiro Aota wrote:
> > confusion.
> > 
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> 
> Thanks Naohiro, this makes it much easier to understand,
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> 
> Dave, I know you're on vacation, this needs to go to Linus for this cycle so we 
> don't have two different SB slots for two different kernels.  I don't want to 
> disturb your vacation, so I'll put this in a pull request tomorrow to Linus.  If 
> you already had plans to do this let me know and I'll hold off.  Thanks,

We don't even have mkfs support for zoned filesystem merged so that the
format is not final was not an issue and that's why I did not hurry
pushing it in that cycle and I still rather not do that.

The design-level problem is the backup copy offset, that's been
discussed in some previous iteration - whether it's right to place all
copies under the first 1T or if we should spread them. For me this is
still not settled, I haven't read through all the replies and given it
enough thought.

That could lead to another change of the offsets that I'm sure we want
to avoid.
David Sterba April 9, 2021, 11:07 a.m. UTC | #3
On Thu, Apr 08, 2021 at 05:25:28PM +0900, Naohiro Aota wrote:
> This commit moves the location of the superblock logging zones. The new
> locations of the logging zones are now determined based on fixed block
> addresses instead of on fixed zone numbers.
> 
> The old placement method based on fixed zone numbers causes problems when
> one needs to inspect a file system image without access to the drive zone
> information. In such case, the super block locations cannot be reliably
> determined as the zone size is unknown. By locating the superblock logging
> zones using fixed addresses, we can scan a dumped file system image without
> the zone information since a super block copy will always be present at or
> after the fixed location.
> 
> This commit introduces the following three pairs of zones containing fixed
> offset locations, regardless of the device zone size.
> 
>   - Primary superblock: zone starting at offset 0 and the following zone
>   - First copy: zone containing offset 64GB and the following zone
>   - Second copy: zone containing offset 256GB and the following zone
> 
> If a logging zone is outside of the disk capacity, we do not record the
> superblock copy.
> 
> The first copy position is much larger than for a regular btrfs volume
> (64M).  This increase is to avoid overlapping with the log zones for the
> primary superblock. This higher location is arbitrary but allows supporting
> devices with very large zone sizes, up to 32GB. Such large zone size is
> unrealistic and very unlikely to ever be seen in real devices. Currently,
> SMR disks have a zone size of 256MB, and we are expecting ZNS drives to be
> in the 1-4GB range, so this 32GB limit gives us room to breathe. For now,
> we only allow zone sizes up to 8GB, below this hard limit of 32GB.
> 
> The fixed location addresses are somewhat arbitrary, but with the intent of
> maintaining superblock reliability even for smaller devices. For this
> reason, the superblock fixed locations do not exceed 1TB.

Yeah, so how much should we adjust the offsets in favor of smaller
devices instead of larger ones? I think this is going the wrong
direction, the capacity will grow, the zones are supposed to be larger.

> The superblock logging zones are reserved for superblock logging and never
> used for data or metadata blocks. Note that we only reserve the two zones
> per primary/copy actually used for superblock logging. We do not reserve
> the ranges of zones possibly containing superblocks with the largest
> supported zone size (0-16GB, 64G-80GB, 256G-272GB).
> 
> The zones containing the fixed location offsets used to store superblocks
> in a regular btrfs volume (no zoned case) are also reserved to avoid
> confusion.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/zoned.c | 43 +++++++++++++++++++++++++++++++++++--------
>  1 file changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 1f972b75a9ab..a4b195fe08a0 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -21,9 +21,28 @@
>  /* Pseudo write pointer value for conventional zone */
>  #define WP_CONVENTIONAL ((u64)-2)
>  
> +/*
> + * Location of the first zone of superblock logging zone pairs.
> + * - Primary superblock: the zone containing offset 0 (zone 0)
> + * - First superblock copy: the zone containing offset 64G
> + * - Second superblock copy: the zone containing offset 256G
> + */
> +#define BTRFS_PRIMARY_SB_LOG_ZONE 0ULL
> +#define BTRFS_FIRST_SB_LOG_ZONE (64ULL * SZ_1G)
> +#define BTRFS_SECOND_SB_LOG_ZONE (256ULL * SZ_1G)
> +#define BTRFS_FIRST_SB_LOG_ZONE_SHIFT const_ilog2(BTRFS_FIRST_SB_LOG_ZONE)
> +#define BTRFS_SECOND_SB_LOG_ZONE_SHIFT const_ilog2(BTRFS_SECOND_SB_LOG_ZONE)

This should be named like BTRFS_SB_LOG_* ie. "namespaces" or common
prefixes for the same class of valuesd.

> +
>  /* Number of superblock log zones */
>  #define BTRFS_NR_SB_LOG_ZONES 2
>  
> +/*
> + * Maximum size of zones. Currently, SMR disks have a zone size of 256MB,
> + * and we are expecting ZNS drives to be in the 1-4GB range. We do not
> + * expect the zone size to become larger than 8GB in the near future.
> + */
> +#define BTRFS_MAX_ZONE_SIZE SZ_8G
> +
>  static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, void *data)
>  {
>  	struct blk_zone *zones = data;
> @@ -111,11 +130,8 @@ static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
>  }
>  
>  /*
> - * The following zones are reserved as the circular buffer on ZONED btrfs.
> - *  - The primary superblock: zones 0 and 1
> - *  - The first copy: zones 16 and 17
> - *  - The second copy: zones 1024 or zone at 256GB which is minimum, and
> - *                     the following one
> + * Get the zone number of the first zone of a pair of contiguous zones used
> + * for superblock logging.
>   */
>  static inline u32 sb_zone_number(int shift, int mirror)
>  {
> @@ -123,8 +139,8 @@ static inline u32 sb_zone_number(int shift, int mirror)
>  
>  	switch (mirror) {
>  	case 0: return 0;
> -	case 1: return 16;
> -	case 2: return min_t(u64, btrfs_sb_offset(mirror) >> shift, 1024);
> +	case 1: return 1 << (BTRFS_FIRST_SB_LOG_ZONE_SHIFT - shift);
> +	case 2: return 1 << (BTRFS_SECOND_SB_LOG_ZONE_SHIFT - shift);
>  	}
>  
>  	return 0;
> @@ -300,10 +316,21 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
>  		zone_sectors = bdev_zone_sectors(bdev);
>  	}
>  
> -	nr_sectors = bdev_nr_sectors(bdev);
>  	/* Check if it's power of 2 (see is_power_of_2) */
>  	ASSERT(zone_sectors != 0 && (zone_sectors & (zone_sectors - 1)) == 0);
>  	zone_info->zone_size = zone_sectors << SECTOR_SHIFT;
> +
> +	/* We reject devices with a zone size larger than 8GB. */
> +	if (zone_info->zone_size > BTRFS_MAX_ZONE_SIZE) {
> +		btrfs_err_in_rcu(fs_info,
> +				 "zoned: %s: zone size %llu is too large",
> +				 rcu_str_deref(device->name),
> +				 zone_info->zone_size);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	nr_sectors = bdev_nr_sectors(bdev);
>  	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
>  	zone_info->max_zone_append_size =
>  		(u64)queue_max_zone_append_sectors(queue) << SECTOR_SHIFT;
> -- 
> 2.31.1
David Sterba April 10, 2021, 9:29 a.m. UTC | #4
On Fri, Apr 09, 2021 at 01:07:19PM +0200, David Sterba wrote:
> On Thu, Apr 08, 2021 at 05:25:28PM +0900, Naohiro Aota wrote:
> > This commit moves the location of the superblock logging zones. The new
> > locations of the logging zones are now determined based on fixed block
> > addresses instead of on fixed zone numbers.
> > 
> > The old placement method based on fixed zone numbers causes problems when
> > one needs to inspect a file system image without access to the drive zone
> > information. In such case, the super block locations cannot be reliably
> > determined as the zone size is unknown. By locating the superblock logging
> > zones using fixed addresses, we can scan a dumped file system image without
> > the zone information since a super block copy will always be present at or
> > after the fixed location.
> > 
> > This commit introduces the following three pairs of zones containing fixed
> > offset locations, regardless of the device zone size.
> > 
> >   - Primary superblock: zone starting at offset 0 and the following zone
> >   - First copy: zone containing offset 64GB and the following zone
> >   - Second copy: zone containing offset 256GB and the following zone
> > 
> > If a logging zone is outside of the disk capacity, we do not record the
> > superblock copy.
> > 
> > The first copy position is much larger than for a regular btrfs volume
> > (64M).  This increase is to avoid overlapping with the log zones for the
> > primary superblock. This higher location is arbitrary but allows supporting
> > devices with very large zone sizes, up to 32GB. Such large zone size is
> > unrealistic and very unlikely to ever be seen in real devices. Currently,
> > SMR disks have a zone size of 256MB, and we are expecting ZNS drives to be
> > in the 1-4GB range, so this 32GB limit gives us room to breathe. For now,
> > we only allow zone sizes up to 8GB, below this hard limit of 32GB.
> > 
> > The fixed location addresses are somewhat arbitrary, but with the intent of
> > maintaining superblock reliability even for smaller devices. For this
> > reason, the superblock fixed locations do not exceed 1TB.
> 
> Yeah, so how much should we adjust the offsets in favor of smaller
> devices instead of larger ones? I think this is going the wrong
> direction, the capacity will grow, the zones are supposed to be larger.

For the record, we had a group discussion about that and the consensus
is to do sb offsets at 0, 512G and 4T.

The rationale is to prefer large devices slightly more than smaller, but
are 2 superblocks available under 1T to cover small devices. Physical
devices' capacity grows and the 4T copy should be available on all of
the HDD type, while flash storage growth is a bit slower but still
projected to be up to 10T within the time we care about now.

The small devices are for usecases with not necessarily a physical
device, eg. emulated or on top of device mapper targets.

The superblock copy itself is not a frequently utilized feature but we
still want to keep it, for parity with non-zoned mode and "just in
case".

I'll update the patch changelog and code to reflect the changes and
resend to all people involved. The pull request is scheduled for
tomorrow.
diff mbox series

Patch

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1f972b75a9ab..a4b195fe08a0 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -21,9 +21,28 @@ 
 /* Pseudo write pointer value for conventional zone */
 #define WP_CONVENTIONAL ((u64)-2)
 
+/*
+ * Location of the first zone of superblock logging zone pairs.
+ * - Primary superblock: the zone containing offset 0 (zone 0)
+ * - First superblock copy: the zone containing offset 64G
+ * - Second superblock copy: the zone containing offset 256G
+ */
+#define BTRFS_PRIMARY_SB_LOG_ZONE 0ULL
+#define BTRFS_FIRST_SB_LOG_ZONE (64ULL * SZ_1G)
+#define BTRFS_SECOND_SB_LOG_ZONE (256ULL * SZ_1G)
+#define BTRFS_FIRST_SB_LOG_ZONE_SHIFT const_ilog2(BTRFS_FIRST_SB_LOG_ZONE)
+#define BTRFS_SECOND_SB_LOG_ZONE_SHIFT const_ilog2(BTRFS_SECOND_SB_LOG_ZONE)
+
 /* Number of superblock log zones */
 #define BTRFS_NR_SB_LOG_ZONES 2
 
+/*
+ * Maximum size of zones. Currently, SMR disks have a zone size of 256MB,
+ * and we are expecting ZNS drives to be in the 1-4GB range. We do not
+ * expect the zone size to become larger than 8GB in the near future.
+ */
+#define BTRFS_MAX_ZONE_SIZE SZ_8G
+
 static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, void *data)
 {
 	struct blk_zone *zones = data;
@@ -111,11 +130,8 @@  static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
 }
 
 /*
- * The following zones are reserved as the circular buffer on ZONED btrfs.
- *  - The primary superblock: zones 0 and 1
- *  - The first copy: zones 16 and 17
- *  - The second copy: zones 1024 or zone at 256GB which is minimum, and
- *                     the following one
+ * Get the zone number of the first zone of a pair of contiguous zones used
+ * for superblock logging.
  */
 static inline u32 sb_zone_number(int shift, int mirror)
 {
@@ -123,8 +139,8 @@  static inline u32 sb_zone_number(int shift, int mirror)
 
 	switch (mirror) {
 	case 0: return 0;
-	case 1: return 16;
-	case 2: return min_t(u64, btrfs_sb_offset(mirror) >> shift, 1024);
+	case 1: return 1 << (BTRFS_FIRST_SB_LOG_ZONE_SHIFT - shift);
+	case 2: return 1 << (BTRFS_SECOND_SB_LOG_ZONE_SHIFT - shift);
 	}
 
 	return 0;
@@ -300,10 +316,21 @@  int btrfs_get_dev_zone_info(struct btrfs_device *device)
 		zone_sectors = bdev_zone_sectors(bdev);
 	}
 
-	nr_sectors = bdev_nr_sectors(bdev);
 	/* Check if it's power of 2 (see is_power_of_2) */
 	ASSERT(zone_sectors != 0 && (zone_sectors & (zone_sectors - 1)) == 0);
 	zone_info->zone_size = zone_sectors << SECTOR_SHIFT;
+
+	/* We reject devices with a zone size larger than 8GB. */
+	if (zone_info->zone_size > BTRFS_MAX_ZONE_SIZE) {
+		btrfs_err_in_rcu(fs_info,
+				 "zoned: %s: zone size %llu is too large",
+				 rcu_str_deref(device->name),
+				 zone_info->zone_size);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	nr_sectors = bdev_nr_sectors(bdev);
 	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
 	zone_info->max_zone_append_size =
 		(u64)queue_max_zone_append_sectors(queue) << SECTOR_SHIFT;