diff mbox series

[v2,2/3] blkid: add magic and probing for zoned btrfs

Message ID 20210414013339.2936229-3-naohiro.aota@wdc.com (mailing list archive)
State New, archived
Headers show
Series implement zone-aware probing/wiping for zoned btrfs | expand

Commit Message

Naohiro Aota April 14, 2021, 1:33 a.m. UTC
This commit adds zone-aware magics and probing functions for zoned btrfs.

Superblock (and its copies) is the only data structure in btrfs with a
fixed location on a device. Since we cannot overwrite in a sequential write
required zone, we cannot place superblock in the zone.

Thus, zoned btrfs use superblock log writing to update superblock on
sequential write required zones. It uses two zones as a circular buffer to
write updated superblocks. Once the first zone is filled up, start writing
into the second buffer. When both zones are filled up and before start
writing to the first zone again, it reset the first zone.

We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones are
full. For this situation, we read out the last superblock of each zone and
compare them to determine which zone is older.

The magics can detect a superblock magic ("_BHRfs_M") at the beginning of
zone #0 or zone #1 to see if it is zoned btrfs. When both zones are filled
up, zoned btrfs reset the first zone to write a new superblock. If btrfs
crash at the moment, we do not see a superblock at zone #0. Thus, we need
to check not only zone #0 but also zone #1.

It also supports temporary magic ("!BHRfS_M") in zone #0. The mkfs.btrfs
first writes the temporary superblock to the zone during the mkfs process.
It will survive there until the zones are filled up and reset. So, we also
need to detect this temporary magic.

Finally, this commit extends probe_btrfs() to load the latest superblock
determined by the write pointers.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 libblkid/src/superblocks/btrfs.c | 185 ++++++++++++++++++++++++++++++-
 1 file changed, 184 insertions(+), 1 deletion(-)

Comments

Johannes Thumshirn April 14, 2021, 9:49 a.m. UTC | #1
On 14/04/2021 03:33, Naohiro Aota wrote:
> This commit adds zone-aware magics and probing functions for zoned btrfs.
> 
> Superblock (and its copies) is the only data structure in btrfs with a

The superblock?

> fixed location on a device. Since we cannot overwrite in a sequential write

cannot do overwrites

> required zone, we cannot place superblock in the zone.

the superblock

> 
> Thus, zoned btrfs use superblock log writing to update superblock on

Thus, zoned btrfs uses superblock log writing to update superblocks on

> sequential write required zones. It uses two zones as a circular buffer to
> write updated superblocks. Once the first zone is filled up, start writing
> into the second buffer. When both zones are filled up and before start

starting to write

> writing to the first zone again, it reset the first zone.
> 
> We can determine the position of the latest superblock by reading write

reading the write pointer

> pointer information from a device. One corner case is when both zones are
> full. For this situation, we read out the last superblock of each zone and
> compare them to determine which zone is older.
> 
> The magics can detect a superblock magic ("_BHRfs_M") at the beginning of
> zone #0 or zone #1 to see if it is zoned btrfs. When both zones are filled
> up, zoned btrfs reset the first zone to write a new superblock. If btrfs


resets

> crash at the moment, we do not see a superblock at zone #0. Thus, we need

crashes

> to check not only zone #0 but also zone #1.
> 
> It also supports temporary magic ("!BHRfS_M") in zone #0. The mkfs.btrfs

the temporary magic [...]. Mkfs.btrfs

[...]

> +	 * Log position:
> +	 *   *: Special case, no superblock is written
> +	 *   0: Use write pointer of zones[0]
> +	 *   1: Use write pointer of zones[1]
> +	 *   C: Compare super blcoks from zones[0] and zones[1], use the latest
                        blocks ~^

[...]

> +	rep_size = sizeof(struct blk_zone_report) + sizeof(struct blk_zone) * 2;
> +	rep = malloc(rep_size);
> +	if (!rep)
> +		return -errno;
> +
> +	memset(rep, 0, rep_size);

I think Damien already pointed this out, but that's a good opportunity for calloc().

Otherwise,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Karel Zak April 14, 2021, 1:47 p.m. UTC | #2
On Wed, Apr 14, 2021 at 10:33:38AM +0900, Naohiro Aota wrote:
> +#define ASSERT(x) assert(x)

Really? ;-)

> +typedef uint64_t u64;
> +typedef uint64_t sector_t;
> +typedef uint8_t u8;

I do not see a reason for u64 and u8 here.

> +
> +#ifdef HAVE_LINUX_BLKZONED_H
> +static int sb_write_pointer(int fd, struct blk_zone *zones, u64 *wp_ret)
> +{
> +	bool empty[BTRFS_NR_SB_LOG_ZONES];
> +	bool full[BTRFS_NR_SB_LOG_ZONES];
> +	sector_t sector;
> +
> +	ASSERT(zones[0].type != BLK_ZONE_TYPE_CONVENTIONAL &&
> +	       zones[1].type != BLK_ZONE_TYPE_CONVENTIONAL);

assert()

 ...
> +		for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
> +			u64 bytenr;
> +
> +			bytenr = ((zones[i].start + zones[i].len)
> +				   << SECTOR_SHIFT) - BTRFS_SUPER_INFO_SIZE;
> +
> +			ret = pread64(fd, buf[i], BTRFS_SUPER_INFO_SIZE,
> +				      bytenr);

 please, use  

     ptr = blkid_probe_get_buffer(pr, BTRFS_SUPER_INFO_SIZE, bytenr);

 the library will care about the buffer and reuse it. It's also
 important to keep blkid_do_wipe() usable.

> +			if (ret != BTRFS_SUPER_INFO_SIZE)
> +				return -EIO;
> +			super[i] = (struct btrfs_super_block *)&buf[i];

  super[i] = (struct btrfs_super_block *) ptr;

> +		}
> +
> +		if (super[0]->generation > super[1]->generation)
> +			sector = zones[1].start;
> +		else
> +			sector = zones[0].start;
> +	} else if (!full[0] && (empty[1] || full[1])) {
> +		sector = zones[0].wp;
> +	} else if (full[0]) {
> +		sector = zones[1].wp;
> +	} else {
> +		return -EUCLEAN;
> +	}
> +	*wp_ret = sector << SECTOR_SHIFT;
> +	return 0;
> +}
> +
> +static int sb_log_offset(blkid_probe pr, uint64_t *bytenr_ret)
> +{
> +	uint32_t zone_num = 0;
> +	uint32_t zone_size_sector;
> +	struct blk_zone_report *rep;
> +	struct blk_zone *zones;
> +	size_t rep_size;
> +	int ret;
> +	uint64_t wp;
> +
> +	zone_size_sector = pr->zone_size >> SECTOR_SHIFT;
> +
> +	rep_size = sizeof(struct blk_zone_report) + sizeof(struct blk_zone) * 2;
> +	rep = malloc(rep_size);
> +	if (!rep)
> +		return -errno;
> +
> +	memset(rep, 0, rep_size);
> +	rep->sector = zone_num * zone_size_sector;
> +	rep->nr_zones = 2;

what about to add to lib/blkdev.c a new function:

   struct blk_zone_report *blkdev_get_zonereport(int fd, uint64 sector, int nzones);

and call this function from your sb_log_offset() as well as from blkid_do_wipe()?

Anyway, calloc() is better than malloc()+memset().

> +	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> +		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
> +		ret = 0;
> +		goto out;
> +	} else if (zones[1].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> +		*bytenr_ret = zones[1].start << SECTOR_SHIFT;
> +		ret = 0;
> +		goto out;
> +	}

what about:

 for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
   if (zones[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
      *bytenr_ret = zones[i].start << SECTOR_SHIFT;
      ret = 0;
      goto out;
   }
 }




 Karel
Naohiro Aota April 14, 2021, 3:08 p.m. UTC | #3
On Wed, Apr 14, 2021 at 03:47:08PM +0200, Karel Zak wrote:
> On Wed, Apr 14, 2021 at 10:33:38AM +0900, Naohiro Aota wrote:
> > +#define ASSERT(x) assert(x)
> 
> Really? ;-)
> 
> > +typedef uint64_t u64;
> > +typedef uint64_t sector_t;
> > +typedef uint8_t u8;
> 
> I do not see a reason for u64 and u8 here.

Yep, these are here just to make it easy to copy the code from
kernel. But this code won't change so much, so I can drop these.

> > +
> > +#ifdef HAVE_LINUX_BLKZONED_H
> > +static int sb_write_pointer(int fd, struct blk_zone *zones, u64 *wp_ret)
> > +{
> > +	bool empty[BTRFS_NR_SB_LOG_ZONES];
> > +	bool full[BTRFS_NR_SB_LOG_ZONES];
> > +	sector_t sector;
> > +
> > +	ASSERT(zones[0].type != BLK_ZONE_TYPE_CONVENTIONAL &&
> > +	       zones[1].type != BLK_ZONE_TYPE_CONVENTIONAL);
> 
> assert()

I will use it.

>  ...
> > +		for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
> > +			u64 bytenr;
> > +
> > +			bytenr = ((zones[i].start + zones[i].len)
> > +				   << SECTOR_SHIFT) - BTRFS_SUPER_INFO_SIZE;
> > +
> > +			ret = pread64(fd, buf[i], BTRFS_SUPER_INFO_SIZE,
> > +				      bytenr);
> 
>  please, use  
> 
>      ptr = blkid_probe_get_buffer(pr, BTRFS_SUPER_INFO_SIZE, bytenr);
> 
>  the library will care about the buffer and reuse it. It's also
>  important to keep blkid_do_wipe() usable.

Sure. I'll use it.

> > +			if (ret != BTRFS_SUPER_INFO_SIZE)
> > +				return -EIO;
> > +			super[i] = (struct btrfs_super_block *)&buf[i];
> 
>   super[i] = (struct btrfs_super_block *) ptr;
> 
> > +		}
> > +
> > +		if (super[0]->generation > super[1]->generation)
> > +			sector = zones[1].start;
> > +		else
> > +			sector = zones[0].start;
> > +	} else if (!full[0] && (empty[1] || full[1])) {
> > +		sector = zones[0].wp;
> > +	} else if (full[0]) {
> > +		sector = zones[1].wp;
> > +	} else {
> > +		return -EUCLEAN;
> > +	}
> > +	*wp_ret = sector << SECTOR_SHIFT;
> > +	return 0;
> > +}
> > +
> > +static int sb_log_offset(blkid_probe pr, uint64_t *bytenr_ret)
> > +{
> > +	uint32_t zone_num = 0;
> > +	uint32_t zone_size_sector;
> > +	struct blk_zone_report *rep;
> > +	struct blk_zone *zones;
> > +	size_t rep_size;
> > +	int ret;
> > +	uint64_t wp;
> > +
> > +	zone_size_sector = pr->zone_size >> SECTOR_SHIFT;
> > +
> > +	rep_size = sizeof(struct blk_zone_report) + sizeof(struct blk_zone) * 2;
> > +	rep = malloc(rep_size);
> > +	if (!rep)
> > +		return -errno;
> > +
> > +	memset(rep, 0, rep_size);
> > +	rep->sector = zone_num * zone_size_sector;
> > +	rep->nr_zones = 2;
> 
> what about to add to lib/blkdev.c a new function:
> 
>    struct blk_zone_report *blkdev_get_zonereport(int fd, uint64 sector, int nzones);
> 
> and call this function from your sb_log_offset() as well as from blkid_do_wipe()?
> 
> Anyway, calloc() is better than malloc()+memset().

Indeed. I will do so.

> > +	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> > +		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
> > +		ret = 0;
> > +		goto out;
> > +	} else if (zones[1].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> > +		*bytenr_ret = zones[1].start << SECTOR_SHIFT;
> > +		ret = 0;
> > +		goto out;
> > +	}
> 
> what about:
> 
>  for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
>    if (zones[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
>       *bytenr_ret = zones[i].start << SECTOR_SHIFT;
>       ret = 0;
>       goto out;
>    }
>  }

Yes, this looks cleaner. Thanks.

> 
> 
> 
>  Karel
> 
> -- 
>  Karel Zak  <kzak@redhat.com>
>  http://karelzak.blogspot.com
>
David Sterba April 16, 2021, 3:52 p.m. UTC | #4
On Wed, Apr 14, 2021 at 10:33:38AM +0900, Naohiro Aota wrote:
> It also supports temporary magic ("!BHRfS_M") in zone #0. The mkfs.btrfs
> first writes the temporary superblock to the zone during the mkfs process.
> It will survive there until the zones are filled up and reset. So, we also
> need to detect this temporary magic.

> +	  /*
> +	   * For zoned btrfs, we also need to detect a temporary superblock
> +	   * at zone #0. Mkfs.btrfs creates it in the initialize process.
> +	   * It persits until both zones are filled up then reset.
> +	   */
> +	  { .magic = "!BHRfS_M", .len = 8, .sboff = 0x40,
> +	    .is_zoned = 1, .zonenum = 0, .kboff_inzone = 0 },

Should we rather reset the zone twice so the initial superblock does not
have the temporary signature?

For the primary superblock at offset 64K and as a fallback the signature
should be valid for detection purposes (ie. not necessarily to get the
latest superblock).
Naohiro Aota April 26, 2021, 1:38 a.m. UTC | #5
On Fri, Apr 16, 2021 at 05:52:41PM +0200, David Sterba wrote:
> On Wed, Apr 14, 2021 at 10:33:38AM +0900, Naohiro Aota wrote:
> > It also supports temporary magic ("!BHRfS_M") in zone #0. The mkfs.btrfs
> > first writes the temporary superblock to the zone during the mkfs process.
> > It will survive there until the zones are filled up and reset. So, we also
> > need to detect this temporary magic.
> 
> > +	  /*
> > +	   * For zoned btrfs, we also need to detect a temporary superblock
> > +	   * at zone #0. Mkfs.btrfs creates it in the initialize process.
> > +	   * It persits until both zones are filled up then reset.
> > +	   */
> > +	  { .magic = "!BHRfS_M", .len = 8, .sboff = 0x40,
> > +	    .is_zoned = 1, .zonenum = 0, .kboff_inzone = 0 },
> 
> Should we rather reset the zone twice so the initial superblock does not
> have the temporary signature?

OK, sure. I'll modify the mkfs.btrfs to reset the superblock log zone
before writing out final superblock.

> For the primary superblock at offset 64K and as a fallback the signature
> should be valid for detection purposes (ie. not necessarily to get the
> latest superblock).
diff mbox series

Patch

diff --git a/libblkid/src/superblocks/btrfs.c b/libblkid/src/superblocks/btrfs.c
index f0fde700d896..812918ac1f42 100644
--- a/libblkid/src/superblocks/btrfs.c
+++ b/libblkid/src/superblocks/btrfs.c
@@ -9,6 +9,12 @@ 
 #include <unistd.h>
 #include <string.h>
 #include <stdint.h>
+#include <stdbool.h>
+#include <assert.h>
+
+#ifdef HAVE_LINUX_BLKZONED_H
+#include <linux/blkzoned.h>
+#endif
 
 #include "superblocks.h"
 
@@ -59,11 +65,176 @@  struct btrfs_super_block {
 	uint8_t label[256];
 } __attribute__ ((__packed__));
 
+#define BTRFS_SUPER_INFO_SIZE 4096
+
+/* Number of superblock log zones */
+#define BTRFS_NR_SB_LOG_ZONES 2
+
+/* Introduce some macros and types to unify the code with kernel side */
+#define SECTOR_SHIFT 9
+
+#define ASSERT(x) assert(x)
+
+typedef uint64_t u64;
+typedef uint64_t sector_t;
+typedef uint8_t u8;
+
+#ifdef HAVE_LINUX_BLKZONED_H
+static int sb_write_pointer(int fd, struct blk_zone *zones, u64 *wp_ret)
+{
+	bool empty[BTRFS_NR_SB_LOG_ZONES];
+	bool full[BTRFS_NR_SB_LOG_ZONES];
+	sector_t sector;
+
+	ASSERT(zones[0].type != BLK_ZONE_TYPE_CONVENTIONAL &&
+	       zones[1].type != BLK_ZONE_TYPE_CONVENTIONAL);
+
+	empty[0] = zones[0].cond == BLK_ZONE_COND_EMPTY;
+	empty[1] = zones[1].cond == BLK_ZONE_COND_EMPTY;
+	full[0] = zones[0].cond == BLK_ZONE_COND_FULL;
+	full[1] = zones[1].cond == BLK_ZONE_COND_FULL;
+
+	/*
+	 * Possible states of log buffer zones
+	 *
+	 *           Empty[0]  In use[0]  Full[0]
+	 * Empty[1]         *          x        0
+	 * In use[1]        0          x        0
+	 * Full[1]          1          1        C
+	 *
+	 * Log position:
+	 *   *: Special case, no superblock is written
+	 *   0: Use write pointer of zones[0]
+	 *   1: Use write pointer of zones[1]
+	 *   C: Compare super blcoks from zones[0] and zones[1], use the latest
+	 *      one determined by generation
+	 *   x: Invalid state
+	 */
+
+	if (empty[0] && empty[1]) {
+		/* Special case to distinguish no superblock to read */
+		*wp_ret = zones[0].start << SECTOR_SHIFT;
+		return -ENOENT;
+	} else if (full[0] && full[1]) {
+		/* Compare two super blocks */
+		u8 buf[BTRFS_NR_SB_LOG_ZONES][BTRFS_SUPER_INFO_SIZE];
+		struct btrfs_super_block *super[BTRFS_NR_SB_LOG_ZONES];
+		int i;
+		int ret;
+
+		for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
+			u64 bytenr;
+
+			bytenr = ((zones[i].start + zones[i].len)
+				   << SECTOR_SHIFT) - BTRFS_SUPER_INFO_SIZE;
+
+			ret = pread64(fd, buf[i], BTRFS_SUPER_INFO_SIZE,
+				      bytenr);
+			if (ret != BTRFS_SUPER_INFO_SIZE)
+				return -EIO;
+			super[i] = (struct btrfs_super_block *)&buf[i];
+		}
+
+		if (super[0]->generation > super[1]->generation)
+			sector = zones[1].start;
+		else
+			sector = zones[0].start;
+	} else if (!full[0] && (empty[1] || full[1])) {
+		sector = zones[0].wp;
+	} else if (full[0]) {
+		sector = zones[1].wp;
+	} else {
+		return -EUCLEAN;
+	}
+	*wp_ret = sector << SECTOR_SHIFT;
+	return 0;
+}
+
+static int sb_log_offset(blkid_probe pr, uint64_t *bytenr_ret)
+{
+	uint32_t zone_num = 0;
+	uint32_t zone_size_sector;
+	struct blk_zone_report *rep;
+	struct blk_zone *zones;
+	size_t rep_size;
+	int ret;
+	uint64_t wp;
+
+	zone_size_sector = pr->zone_size >> SECTOR_SHIFT;
+
+	rep_size = sizeof(struct blk_zone_report) + sizeof(struct blk_zone) * 2;
+	rep = malloc(rep_size);
+	if (!rep)
+		return -errno;
+
+	memset(rep, 0, rep_size);
+	rep->sector = zone_num * zone_size_sector;
+	rep->nr_zones = 2;
+
+	ret = ioctl(pr->fd, BLKREPORTZONE, rep);
+	if (ret) {
+		ret = -errno;
+		goto out;
+	}
+	if (rep->nr_zones != 2) {
+		ret = 1;
+		goto out;
+	}
+
+	zones = (struct blk_zone *)(rep + 1);
+
+	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
+		ret = 0;
+		goto out;
+	} else if (zones[1].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+		*bytenr_ret = zones[1].start << SECTOR_SHIFT;
+		ret = 0;
+		goto out;
+	}
+
+	ret = sb_write_pointer(pr->fd, zones, &wp);
+	if (ret != -ENOENT && ret) {
+		ret = 1;
+		goto out;
+	}
+	if (ret != -ENOENT) {
+		if (wp == zones[0].start << SECTOR_SHIFT)
+			wp = (zones[1].start + zones[1].len) << SECTOR_SHIFT;
+		wp -= BTRFS_SUPER_INFO_SIZE;
+	}
+	*bytenr_ret = wp;
+
+	ret = 0;
+out:
+	free(rep);
+
+	return ret;
+}
+#endif
+
 static int probe_btrfs(blkid_probe pr, const struct blkid_idmag *mag)
 {
 	struct btrfs_super_block *bfs;
 
-	bfs = blkid_probe_get_sb(pr, mag, struct btrfs_super_block);
+	if (pr->zone_size) {
+#ifdef HAVE_LINUX_BLKZONED_H
+		uint64_t offset = 0;
+		int ret;
+
+		ret = sb_log_offset(pr, &offset);
+		if (ret)
+			return ret;
+		bfs = (struct btrfs_super_block *)
+			blkid_probe_get_buffer(pr, offset,
+					       sizeof(struct btrfs_super_block));
+#else
+		/* Nothing can be done */
+		return 1;
+#endif
+	} else {
+		bfs = blkid_probe_get_sb(pr, mag, struct btrfs_super_block);
+	}
 	if (!bfs)
 		return errno ? -errno : 1;
 
@@ -88,6 +259,18 @@  const struct blkid_idinfo btrfs_idinfo =
 	.magics		=
 	{
 	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40, .kboff = 64 },
+	  /* For zoned btrfs */
+	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40,
+	    .is_zoned = 1, .zonenum = 0, .kboff_inzone = 0 },
+	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40,
+	    .is_zoned = 1, .zonenum = 1, .kboff_inzone = 0 },
+	  /*
+	   * For zoned btrfs, we also need to detect a temporary superblock
+	   * at zone #0. Mkfs.btrfs creates it in the initialize process.
+	   * It persits until both zones are filled up then reset.
+	   */
+	  { .magic = "!BHRfS_M", .len = 8, .sboff = 0x40,
+	    .is_zoned = 1, .zonenum = 0, .kboff_inzone = 0 },
 	  { NULL }
 	}
 };