Message ID | 20230914-raid-stripe-tree-v9-0-15d423829637@wdc.com (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: introduce RAID stripe tree | expand |
On Thu, Sep 14, 2023 at 09:06:55AM -0700, Johannes Thumshirn wrote: > Updates of the raid-stripe-tree are done at ordered extent write time to safe > on bandwidth while for reading we do the stripe-tree lookup on bio mapping > time, i.e. when the logical to physical translation happens for regular btrfs > RAID as well. > > The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and > it's contents are the respective physical device id and position. > > For an example 1M write (split into 126K segments due to zone-append) > rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test > wrote 1048576/1048576 bytes at offset 0 > 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec) > > The tree will look as follows (both 128k buffered writes to a ZNS drive): > > RAID0 case: > bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1 > btrfs-progs v6.3 > raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) > leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE > leaf 805535744 flags 0x1(WRITTEN) backref revision 1 > checksum stored 2d2d2262 > checksum calced 2d2d2262 > fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438 > chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23 > item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56 > encoding: RAID0 > stripe 0 devid 1 offset 805306368 length 131072 > stripe 1 devid 2 offset 536870912 length 131072 > total bytes 42949672960 > bytes used 294912 > uuid ab05cfc6-9859-404e-970d-3999b1cb5438 > > RAID1 case: > bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1 > btrfs-progs v6.3 > raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) > leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE > leaf 805535744 flags 0x1(WRITTEN) backref revision 1 > checksum stored 56199539 > checksum calced 56199539 > fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045 > chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4 > item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56 > encoding: RAID1 > stripe 0 devid 1 offset 939524096 length 65536 > stripe 1 devid 2 offset 536870912 length 65536 > total bytes 42949672960 > bytes used 294912 > uuid 9e693a37-fbd1-4891-aed2-e7fe64605045 > > A design document can be found here: > https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true Please also turn it to developer documentation file (in btrfs-progs/Documentation/dev), it can follow the same structure. > > The user-space part of this series can be found here: > https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com > > Changes to v8: > - Changed tracepoints according to David's comments > - Mark on-disk structures as packed > - Got rid of __DECLARE_FLEX_ARRAY > - Rebase onto misc-next > - Split out helpers for new btrfs_load_block_group_zone_info RAID cases > - Constify declarations where possible > - Initialise variables before use > - Lower scope of variables > - Remove btrfs_stripe_root() helper > - Pick different BTRFS_RAID_STRIPE_KEY constant > - Reorder on-disk encoding types to match the raid_index > - And possibly more, please git range-diff the versions > - Link to v8: https://lore.kernel.org/r/20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com v9 will be added as topic branch to for-next, I did several style changes so please send any updates as incrementals if needed.
On Thu, Sep 14, 2023 at 08:25:34PM +0200, David Sterba wrote: > On Thu, Sep 14, 2023 at 09:06:55AM -0700, Johannes Thumshirn wrote: > > Updates of the raid-stripe-tree are done at ordered extent write time to safe > > on bandwidth while for reading we do the stripe-tree lookup on bio mapping > > time, i.e. when the logical to physical translation happens for regular btrfs > > RAID as well. > > > > The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and > > it's contents are the respective physical device id and position. > > > > For an example 1M write (split into 126K segments due to zone-append) > > rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test > > wrote 1048576/1048576 bytes at offset 0 > > 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec) > > > > The tree will look as follows (both 128k buffered writes to a ZNS drive): > > > > RAID0 case: > > bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1 > > btrfs-progs v6.3 > > raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) > > leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE > > leaf 805535744 flags 0x1(WRITTEN) backref revision 1 > > checksum stored 2d2d2262 > > checksum calced 2d2d2262 > > fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438 > > chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23 > > item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56 > > encoding: RAID0 > > stripe 0 devid 1 offset 805306368 length 131072 > > stripe 1 devid 2 offset 536870912 length 131072 > > total bytes 42949672960 > > bytes used 294912 > > uuid ab05cfc6-9859-404e-970d-3999b1cb5438 > > > > RAID1 case: > > bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1 > > btrfs-progs v6.3 > > raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) > > leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE > > leaf 805535744 flags 0x1(WRITTEN) backref revision 1 > > checksum stored 56199539 > > checksum calced 56199539 > > fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045 > > chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4 > > item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56 > > encoding: RAID1 > > stripe 0 devid 1 offset 939524096 length 65536 > > stripe 1 devid 2 offset 536870912 length 65536 > > total bytes 42949672960 > > bytes used 294912 > > uuid 9e693a37-fbd1-4891-aed2-e7fe64605045 > > > > A design document can be found here: > > https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true > > Please also turn it to developer documentation file (in > btrfs-progs/Documentation/dev), it can follow the same structure. > > > > > The user-space part of this series can be found here: > > https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com > > > > Changes to v8: > > - Changed tracepoints according to David's comments > > - Mark on-disk structures as packed > > - Got rid of __DECLARE_FLEX_ARRAY > > - Rebase onto misc-next > > - Split out helpers for new btrfs_load_block_group_zone_info RAID cases > > - Constify declarations where possible > > - Initialise variables before use > > - Lower scope of variables > > - Remove btrfs_stripe_root() helper > > - Pick different BTRFS_RAID_STRIPE_KEY constant > > - Reorder on-disk encoding types to match the raid_index > > - And possibly more, please git range-diff the versions > > - Link to v8: https://lore.kernel.org/r/20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com > > v9 will be added as topic branch to for-next, I did several style > changes so please send any updates as incrementals if needed. Moved to misc-next. I'll do a minor release of btrfs-progs soon so we get the tool support for testing.
Updates of the raid-stripe-tree are done at ordered extent write time to safe on bandwidth while for reading we do the stripe-tree lookup on bio mapping time, i.e. when the logical to physical translation happens for regular btrfs RAID as well. The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and it's contents are the respective physical device id and position. For an example 1M write (split into 126K segments due to zone-append) rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec) The tree will look as follows (both 128k buffered writes to a ZNS drive): RAID0 case: bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1 btrfs-progs v6.3 raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE leaf 805535744 flags 0x1(WRITTEN) backref revision 1 checksum stored 2d2d2262 checksum calced 2d2d2262 fs uuid ab05cfc6-9859-404e-970d-3999b1cb5438 chunk uuid c9470ba2-49ac-4d46-8856-438a18e6bd23 item 0 key (1073741824 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56 encoding: RAID0 stripe 0 devid 1 offset 805306368 length 131072 stripe 1 devid 2 offset 536870912 length 131072 total bytes 42949672960 bytes used 294912 uuid ab05cfc6-9859-404e-970d-3999b1cb5438 RAID1 case: bash-5.2# btrfs inspect-internal dump-tree -t raid_stripe /dev/nvme0n1 btrfs-progs v6.3 raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) leaf 805535744 items 1 free space 16218 generation 8 owner RAID_STRIPE_TREE leaf 805535744 flags 0x1(WRITTEN) backref revision 1 checksum stored 56199539 checksum calced 56199539 fs uuid 9e693a37-fbd1-4891-aed2-e7fe64605045 chunk uuid 691874fc-1b9c-469b-bd7f-05e0e6ba88c4 item 0 key (939524096 RAID_STRIPE_KEY 131072) itemoff 16243 itemsize 56 encoding: RAID1 stripe 0 devid 1 offset 939524096 length 65536 stripe 1 devid 2 offset 536870912 length 65536 total bytes 42949672960 bytes used 294912 uuid 9e693a37-fbd1-4891-aed2-e7fe64605045 A design document can be found here: https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true The user-space part of this series can be found here: https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com Changes to v8: - Changed tracepoints according to David's comments - Mark on-disk structures as packed - Got rid of __DECLARE_FLEX_ARRAY - Rebase onto misc-next - Split out helpers for new btrfs_load_block_group_zone_info RAID cases - Constify declarations where possible - Initialise variables before use - Lower scope of variables - Remove btrfs_stripe_root() helper - Pick different BTRFS_RAID_STRIPE_KEY constant - Reorder on-disk encoding types to match the raid_index - And possibly more, please git range-diff the versions - Link to v8: https://lore.kernel.org/r/20230911-raid-stripe-tree-v8-0-647676fa852c@wdc.com Changes to v7: - Huge rewrite v7 of the patchset can be found here: https://lore.kernel.org/linux-btrfs/cover.1677750131.git.johannes.thumshirn@wdc.com/ Changes to v6: - Fix degraded RAID1 mounts - Fix RAID0/10 mounts v6 of the patchset can be found here: https://lore/kernel.org/linux-btrfs/cover.1676470614.git.johannes.thumshirn@wdc.com Changes to v5: - Incroporated review comments from Josef and Christoph - Rebased onto misc-next v5 of the patchset can be found here: https://lore/kernel.org/linux-btrfs/cover.1675853489.git.johannes.thumshirn@wdc.com Changes to v4: - Added patch to check for RST feature in sysfs - Added RST lookups for scrubbing - Fixed the error handling bug Josef pointed out - Only check if we need to write out a RST once per delayed_ref head - Added support for zoned data DUP with RST Changes to v3: - Rebased onto 20221120124734.18634-1-hch@lst.de - Incorporated Josef's review - Merged related patches v3 of the patchset can be found here: https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com Changes to v2: - Bug fixes - Rebased onto 20220901074216.1849941-1-hch@lst.de - Added tracepoints - Added leak checker - Added RAID0 and RAID10 v2 of the patchset can be found here: https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com Changes to v1: - Write the stripe-tree at delayed-ref time (Qu) - Add a different write path for preallocation v1 of the patchset can be found here: https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> --- Johannes Thumshirn (11): btrfs: add raid stripe tree definitions btrfs: read raid-stripe-tree from disk btrfs: add support for inserting raid stripe extents btrfs: delete stripe extent on extent deletion btrfs: lookup physical address from stripe extent btrfs: implement RST version of scrub btrfs: zoned: allow zoned RAID btrfs: add raid stripe tree pretty printer btrfs: announce presence of raid-stripe-tree in sysfs btrfs: add trace events for RST btrfs: add raid-stripe-tree to features enabled with debug fs/btrfs/Makefile | 2 +- fs/btrfs/accessors.h | 10 + fs/btrfs/bio.c | 25 +++ fs/btrfs/block-rsv.c | 6 + fs/btrfs/disk-io.c | 18 ++ fs/btrfs/extent-tree.c | 7 + fs/btrfs/fs.h | 4 +- fs/btrfs/inode.c | 8 +- fs/btrfs/locking.c | 1 + fs/btrfs/ordered-data.c | 1 + fs/btrfs/ordered-data.h | 2 + fs/btrfs/print-tree.c | 26 +++ fs/btrfs/raid-stripe-tree.c | 449 ++++++++++++++++++++++++++++++++++++++++ fs/btrfs/raid-stripe-tree.h | 52 +++++ fs/btrfs/scrub.c | 53 +++++ fs/btrfs/sysfs.c | 3 + fs/btrfs/volumes.c | 43 +++- fs/btrfs/volumes.h | 16 +- fs/btrfs/zoned.c | 144 ++++++++++++- include/trace/events/btrfs.h | 75 +++++++ include/uapi/linux/btrfs.h | 1 + include/uapi/linux/btrfs_tree.h | 31 +++ 22 files changed, 954 insertions(+), 23 deletions(-) --- base-commit: 1d73023d96965a5c8fb76a39aec88d840ebe5b21 change-id: 20230613-raid-stripe-tree-e330c9a45cc3 Best regards,