diff mbox series

btrfs-dev-docs: add document for RAID stripe tree

Message ID 20230918140515.2597127-1-johannes.thumshirn@wdc.com (mailing list archive)
State New, archived
Headers show
Series btrfs-dev-docs: add document for RAID stripe tree | expand

Commit Message

Johannes Thumshirn Sept. 18, 2023, 2:05 p.m. UTC
Add a document describing the layout and functionallity of the newly
introduced RAID Stripe Tree.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 raid-stripe-tree.txt | 318 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 318 insertions(+)
 create mode 100644 raid-stripe-tree.txt

Comments

David Sterba Oct. 23, 2023, 6:51 p.m. UTC | #1
On Mon, Sep 18, 2023 at 07:05:11AM -0700, Johannes Thumshirn wrote:
> Add a document describing the layout and functionallity of the newly
> introduced RAID Stripe Tree.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

I've added the document to btrfs-progs, the btrfs-dev-docs repo is not
much updated nor visible. The development documenation could be next to
the user documentation on the rtfd.io site. As this is .txt I haven't
reformatted it to RST, for publishing on web this needs to be done
first. Thanks.
diff mbox series

Patch

diff --git a/raid-stripe-tree.txt b/raid-stripe-tree.txt
new file mode 100644
index 000000000000..7cba1d2f0d75
--- /dev/null
+++ b/raid-stripe-tree.txt
@@ -0,0 +1,318 @@ 
+BTRFS RAID Stripe Tree Design
+=============================
+
+
+Problem Statement
+-----------------
+
+
+RAID on zoned devices
+---------------------
+The current implementation of RAID profiles in BTRFS is based on the implicit
+assumption that data placement is deterministic in the device chunks used for
+mapping block groups. 
+With deterministic data placement, all physical on-disk extents of one logical
+file extent are positioned at the same offset relative to the starting LBA of
+a device chunk. This prevents the need for reading any meta-data to access an
+on-disk file extent. Figure 1 shows an example of it.
+
+
+	+------------+      +------------+
+	|            |      |            |
+	|            |      |            |
+	|            |      |            |
+	+------------+      +------------+
+	|            |      |            |
+	|    D 1     |      |    D 1     |
+	|            |      |            |
+	+------------+      +------------+
+	|            |      |            |
+	|    D 2     |      |    D 2     |
+	|            |      |            |
+	+------------+      +------------+
+	|            |      |            |
+	|            |      |            |
+	|            |      |            |
+	|            |      |            |
+	+------------+      +------------+
+	Figure 1: Deterministic data placement
+
+
+
+With non-deterministic data placement, the on-disk extents of a logical file
+extent can be scattered around inside the chunk. To read back the data with
+non-deterministic data placement, additional meta-data describing the position
+of the extents inside a chunk is needed. Figure 2 shows an example for this
+style of data placement.
+
+
+	+------------+      +------------+
+	|            |      |            |
+	|            |      |            |
+	|            |      |            |
+	+------------+      +------------+
+	|            |      |            |
+	|    D 1     |      |    D 2     |
+	|            |      |            |
+	+------------+      +------------+
+	|            |      |            |
+	|    D 2     |      |    D 1     |
+	|            |      |            |
+	+------------+      +------------+
+	|            |      |            |
+	|            |      |            |
+	|            |      |            |
+	|            |      |            |
+	+------------+      +------------+
+	Figure 2: Non-deterministic data placement
+  
+As BTRFS support for zoned block devices uses the Zone Append operation for
+writing file data extents, there is no guarantee that the written extents have
+the same offset within different device chunks. This implies that to be able
+to use RAID with zoned devices, non-deterministic data placement must be
+supported and additional meta-data describing the location of file extents
+within device chunks is needed.
+
+
+Lessons learned from RAID 5
+---------------------------
+BTRFS implementation of RAID levels 5 and 6 suffer from the well-known RAID
+write hole problem. This problem exists because sub-stripe write operations
+are not done using a copy-on-write (COW) but done using Read-Modify-Write
+(RMW). With out-of-place writing like COW, no blocks will be overwritten and
+there is no risk of exposing bad data or corrupting a data stripe parity in
+case of sudden power loss or other unexpected events preventing correctly
+writing to the device.
+
+RAID Stripe Tree Design overview
+--------------------------------
+
+To solve the problems stated above, additional meta-data is introduced: a RAID
+Stripe Tree holds the logical to physical translation for the RAID stripes.
+For each logical file extent (struct btrfs_file_extent_item) a stripe extent
+is created (struct btrfs_stripe_extent). Each btrfs_stripe_extent entry is a
+container for an array of struct btrfs_raid_stride. A struct btrfs_raid_stride
+holds the device ID and the physical start location on that device of the
+sub-stripe of a file extent, as well as the stride's length.
+Each struct btrfs_stripe_extent is keyed by the struct btrfs_file_extent_item
+disk_bytenr and disk_num_bytes, with disk_bytenr as the objectid for the
+btrfs_key and disk_num_bytes as the offset. The key’s type is
+BTRFS_STRIPE_EXTENT_KEY.
+
+On-disk format
+--------------
+
+struct btrfs_file_extent_item {
+        /* […] */
+        __le64 disk_bytenr; ------------------------------------+ 
+        __le64 disk_num_bytes;  --------------------------------|----+
+	/* […] */                                               |    |
+};                                                              |    |
+                                                                |    |
+struct btrfs_key key = {   -------------------------------------|----|--+
+	.objectid = btrfs_file_extent_item::disk_bytenr, <------+    |  |
+	.type = BTRFS_STRIPE_EXTENT_KEY,                             |  |
+	.offset = btrfs_file_extent_item::disk_num_bytes, <----------+  |
+};                                                                      |
+                                                                        |
+struct btrfs_raid_stride { <------------------+                         |
+       __le64 devid;                          |                         |
+       __le64 physical;                       |                         |
+       __le64 length;                         |                         |
+};                                            |                         |
+                                              |                         |
+struct btrfs_stripe_extent {  <---------------|-------------------------+
+	u8 encoding;                          |
+	u8 reserved[7];                       |
+       struct btrfs_raid_stride strides[]; ---+
+};
+
+
+User-space support
+------------------
+
+
+mkfs
+----
+Supporting the RAID Stripe Tree in user-space consists of three things to do
+for mkfs. The first step is creating the root of the RAID Stripe Tree itself.
+Then mkfs must set the incompat flag, so mounting a filesystem with a RAID
+Stripe Tree is impossible for a kernel version without appropriate support.
+Lastly it must allow RAID profiles on zoned devices once the tree is present.
+
+
+Check
+-----
+The ‘btrfs check’ support for RAID Stripe Tree is not implemented yet, but its
+task is to read the struct btrfs_stripe_extent entries for each struct
+btrfs_file_extent_item and verify that a correct mapping between these two
+exists. If data checksum verification is requested as well, the tree also must
+be read to perform the logical to physical translation, otherwise the data
+cannot be read, and the checksums cannot be verified.
+
+
+Example tree dumps
+------------------
+
+Example 1: Write 128k to and empty FS
+
+RAID0
+        item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
+                        encoding: RAID0
+                        stripe 0 devid 1 physical XXXXXXXXX length 65536
+                        stripe 1 devid 2 physical XXXXXXXXX length 65536
+RAID1
+                        stripe 0 devid 1 physical XXXXXXXXX length 131072
+                        stripe 1 devid 2 physical XXXXXXXXX length 131072
+RAID10
+        item 0 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 104
+                        encoding: RAID10
+                        stripe 0 devid 1 physical XXXXXXXXX length 65536
+                        stripe 1 devid 2 physical XXXXXXXXX length 65536
+                        stripe 2 devid 3 physical XXXXXXXXX length 65536
+                        stripe 3 devid 4 physical XXXXXXXXX length 65536
+
+Example 2: Pre-fill one 65k stripe, write 4k to 2nd stripe, write 64k then
+write 16k.
+
+RAID0
+        item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 32
+                        encoding: RAID0
+                        stripe 0 devid 1 physical XXXXXXXXX length 65536
+        item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
+                        encoding: RAID0
+                        stripe 0 devid 2 physical XXXXXXXXX length 4096
+        item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
+                        encoding: RAID0
+                        stripe 0 devid 2 physical XXXXXXXXX length 61440
+                        stripe 1 devid 1 physical XXXXXXXXX length 4096
+        item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 32
+                        encoding: RAID0
+                        stripe 0 devid 1 physical XXXXXXXXX length 4096
+RAID1
+        item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
+                        encoding: RAID1
+                        stripe 0 devid 1 physical XXXXXXXXX length 65536
+                        stripe 1 devid 2 physical XXXXXXXXX length 65536
+        item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
+                        encoding: RAID1
+                        stripe 0 devid 1 physical XXXXXXXXX length 4096
+                        stripe 1 devid 2 physical XXXXXXXXX length 4096
+        item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
+                        encoding: RAID1
+                        stripe 0 devid 1 physical XXXXXXXXX length 65536
+                        stripe 1 devid 2 physical XXXXXXXXX length 65536
+        item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
+                        encoding: RAID1
+                        stripe 0 devid 1 physical XXXXXXXXX length 4096
+                        stripe 1 devid 2 physical XXXXXXXXX length 4096
+RAID10
+        item 0 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 56
+                        encoding: RAID10
+                        stripe 0 devid 1 physical XXXXXXXXX length 65536
+                        stripe 1 devid 2 physical XXXXXXXXX length 65536
+        item 1 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
+                        encoding: RAID10
+                        stripe 0 devid 3 physical XXXXXXXXX length 4096
+                        stripe 1 devid 4 physical XXXXXXXXX length 4096
+        item 2 key (XXXXXX RAID_STRIPE_KEY 65536) itemoff XXXXX itemsize 104
+                        encoding: RAID10
+                        stripe 0 devid 3 physical XXXXXXXXX length 61440
+                        stripe 1 devid 4 physical XXXXXXXXX length 61440
+                        stripe 2 devid 1 physical XXXXXXXXX length 4096
+                        stripe 3 devid 2 physical XXXXXXXXX length 4096
+        item 3 key (XXXXXX RAID_STRIPE_KEY 4096) itemoff XXXXX itemsize 56
+                        encoding: RAID10
+                        stripe 0 devid 1 physical XXXXXXXXX length 4096
+                        stripe 1 devid 2 physical XXXXXXXXX length 4096
+
+Example 3: Pre-fill stripe with 32K data, then write 64K of data and then
+overwrite 8k in the middle.
+
+RAID0
+        item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 32
+                        encoding: RAID0
+                        stripe 0 devid 1 physical XXXXXXXXX length 32768
+        item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 80
+                        encoding: RAID0
+                        stripe 0 devid 1 physical XXXXXXXXX length 32768
+                        stripe 1 devid 2 physical XXXXXXXXX length 65536
+                        stripe 2 devid 1 physical XXXXXXXXX length 32768
+        item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 32
+                        encoding: RAID0
+                        stripe 0 devid 1 physical XXXXXXXXX length 8192
+RAID1
+        item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
+                        encoding: RAID1
+                        stripe 0 devid 1 physical XXXXXXXXX length 32768
+                        stripe 1 devid 2 physical XXXXXXXXX length 32768
+        item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 56
+                        encoding: RAID1
+                        stripe 0 devid 1 physical XXXXXXXXX length 131072
+                        stripe 1 devid 2 physical XXXXXXXXX length 131072
+        item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
+                        encoding: RAID1
+                        stripe 0 devid 1 physical XXXXXXXXX length 8192
+                        stripe 1 devid 2 physical XXXXXXXXX length 8192
+RAID10
+        item 0 key (XXXXXX RAID_STRIPE_KEY 32768) itemoff XXXXX itemsize 56
+                        encoding: RAID10
+                        stripe 0 devid 1 physical XXXXXXXXX length 32768
+                        stripe 1 devid 2 physical XXXXXXXXX length 32768
+        item 1 key (XXXXXX RAID_STRIPE_KEY 131072) itemoff XXXXX itemsize 152
+                        encoding: RAID10
+                        stripe 0 devid 1 physical XXXXXXXXX length 32768
+                        stripe 1 devid 2 physical XXXXXXXXX length 32768
+                        stripe 2 devid 3 physical XXXXXXXXX length 65536
+                        stripe 3 devid 4 physical XXXXXXXXX length 65536
+                        stripe 4 devid 1 physical XXXXXXXXX length 32768
+                        stripe 5 devid 2 physical XXXXXXXXX length 32768
+        item 2 key (XXXXXX RAID_STRIPE_KEY 8192) itemoff XXXXX itemsize 56
+                        encoding: RAID10
+                        stripe 0 devid 1 physical XXXXXXXXX length 8192
+                        stripe 1 devid 2 physical XXXXXXXXX length 8192
+
+
+Glossary
+--------
+
+
+RAID
+	Redundant Array of Independent Disks. This is a storage mechanism
+	where data is not stored on a single disk alone but either mirrored
+	(in case of RAID 1) or split across multiple disks (RAID 0). Other
+	RAID levels like RAID5 and RAID6 stripe the data across multiple disks
+	and add parity information to enable data recovery in case of a disk
+	failure. 
+
+
+LBA 
+	Logical Block Address. LBAs describe the address space of a block
+	device as a linearly increasing address map. LBAs are internally
+	mapped to different physical locations  by the device firmware.
+
+
+Zoned Block Device 
+	Zoned Block Devices are a special kind of block devices that partition
+	their LBA space into so called zones. These zones can impose write
+	constraints on the host, e.g., allowing only sequential writes aligned
+	to a zone write pointer.
+
+
+Zone Append 
+	A write operation where the start LBA of a Zone is specified instead
+	of a destination LBA for the data to be written. On completion the
+	device reports the starting LBA used to write the data back to the
+	host.
+
+
+Copy-on-Write
+	A write technique where the data is not overwritten, but a new version
+	of it is written out of place.
+
+
+Read-Modify-Write
+	A write technique where the data to be written is first read from the
+	block device, modified in memory and the modified data written back in
+	place on the block device.
+