mbox series

[v7,00/13] btrfs: introduce RAID stripe tree

Message ID cover.1677750131.git.johannes.thumshirn@wdc.com (mailing list archive)
Headers show
Series btrfs: introduce RAID stripe tree | expand

Message

Johannes Thumshirn March 2, 2023, 9:45 a.m. UTC
Updates of the raid-stripe-tree are done at delayed-ref time to safe on
bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
i.e. when the logical to physical translation happens for regular btrfs RAID
as well.

The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
it's contents are the respective physical device id and position.

For an example 1M write (split into 126K segments due to zone-append)
rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)

The tree will look as follows:

rapido2:/home/johannes/src/fstests# btrfs inspect-internal dump-tree -t raid_stripe /dev/nullb0
btrfs-progs v5.16.1 
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805847040 items 9 free space 15770 generation 9 owner RAID_STRIPE_TREE
leaf 805847040 flags 0x1(WRITTEN) backref revision 1
checksum stored 1b22e13800000000000000000000000000000000000000000000000000000000
checksum calced 1b22e13800000000000000000000000000000000000000000000000000000000
fs uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
chunk uuid 6f2d8aaa-d348-4bf2-9b5e-141a37ba4c77
        item 0 key (939524096 RAID_STRIPE_KEY 126976) itemoff 16251 itemsize 32
                        stripe 0 devid 1 offset 939524096
                        stripe 1 devid 2 offset 536870912
        item 1 key (939651072 RAID_STRIPE_KEY 126976) itemoff 16219 itemsize 32
                        stripe 0 devid 1 offset 939651072
                        stripe 1 devid 2 offset 536997888
        item 2 key (939778048 RAID_STRIPE_KEY 126976) itemoff 16187 itemsize 32
                        stripe 0 devid 1 offset 939778048
                        stripe 1 devid 2 offset 537124864
        item 3 key (939905024 RAID_STRIPE_KEY 126976) itemoff 16155 itemsize 32
                        stripe 0 devid 1 offset 939905024
                        stripe 1 devid 2 offset 537251840
        item 4 key (940032000 RAID_STRIPE_KEY 126976) itemoff 16123 itemsize 32
                        stripe 0 devid 1 offset 940032000
                        stripe 1 devid 2 offset 537378816
        item 5 key (940158976 RAID_STRIPE_KEY 126976) itemoff 16091 itemsize 32
                        stripe 0 devid 1 offset 940158976
                        stripe 1 devid 2 offset 537505792
        item 6 key (940285952 RAID_STRIPE_KEY 126976) itemoff 16059 itemsize 32
                        stripe 0 devid 1 offset 940285952
                        stripe 1 devid 2 offset 537632768
        item 7 key (940412928 RAID_STRIPE_KEY 126976) itemoff 16027 itemsize 32
                        stripe 0 devid 1 offset 940412928
                        stripe 1 devid 2 offset 537759744
        item 8 key (940539904 RAID_STRIPE_KEY 32768) itemoff 15995 itemsize 32
                        stripe 0 devid 1 offset 940539904
                        stripe 1 devid 2 offset 537886720
total bytes 26843545600
bytes used 1245184
uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb

A design document can be found here:
https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true

The user-space part of this series can be found here:
https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com

Changes to v6:
- Fix degraded RAID1 mounts
- Fix RAID0/10 mounts

v6 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1676470614.git.johannes.thumshirn@wdc.com

Changes to v5:
- Incroporated review comments from Josef and Christoph
- Rebased onto misc-next

v5 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1675853489.git.johannes.thumshirn@wdc.com

Changes to v4:
- Added patch to check for RST feature in sysfs
- Added RST lookups for scrubbing 
- Fixed the error handling bug Josef pointed out
- Only check if we need to write out a RST once per delayed_ref head
- Added support for zoned data DUP with RST

Changes to v3:
- Rebased onto 20221120124734.18634-1-hch@lst.de
- Incorporated Josef's review
- Merged related patches

v3 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com

Changes to v2:
- Bug fixes
- Rebased onto 20220901074216.1849941-1-hch@lst.de
- Added tracepoints
- Added leak checker
- Added RAID0 and RAID10

v2 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com

Changes to v1:
- Write the stripe-tree at delayed-ref time (Qu)
- Add a different write path for preallocation

v1 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/

Johannes Thumshirn (13):
  btrfs: re-add trans parameter to insert_delayed_ref
  btrfs: add raid stripe tree definitions
  btrfs: read raid-stripe-tree from disk
  btrfs: add support for inserting raid stripe extents
  btrfs: delete stripe extent on extent deletion
  btrfs: lookup physical address from stripe extent
  btrfs: add raid stripe tree pretty printer
  btrfs: zoned: allow zoned RAID
  btrfs: check for leaks of ordered stripes on umount
  btrfs: add tracepoints for ordered stripes
  btrfs: announce presence of raid-stripe-tree in sysfs
  btrfs: consult raid-stripe-tree when scrubbing
  btrfs: add raid-stripe-tree to features enabled with debug

 fs/btrfs/Makefile               |   2 +-
 fs/btrfs/accessors.h            |  29 +++
 fs/btrfs/bio.c                  |  29 +++
 fs/btrfs/block-rsv.c            |   1 +
 fs/btrfs/delayed-ref.c          |  13 +-
 fs/btrfs/delayed-ref.h          |   2 +
 fs/btrfs/disk-io.c              |  24 ++
 fs/btrfs/disk-io.h              |   5 +
 fs/btrfs/extent-tree.c          |  68 ++++++
 fs/btrfs/fs.h                   |   7 +-
 fs/btrfs/inode.c                |  15 +-
 fs/btrfs/print-tree.c           |  21 ++
 fs/btrfs/raid-stripe-tree.c     | 416 ++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h     |  87 +++++++
 fs/btrfs/scrub.c                |  33 ++-
 fs/btrfs/super.c                |   1 +
 fs/btrfs/sysfs.c                |   3 +
 fs/btrfs/volumes.c              |  46 +++-
 fs/btrfs/volumes.h              |  13 +-
 fs/btrfs/zoned.c                | 119 ++++++++-
 include/trace/events/btrfs.h    |  50 ++++
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  20 +-
 23 files changed, 973 insertions(+), 32 deletions(-)
 create mode 100644 fs/btrfs/raid-stripe-tree.c
 create mode 100644 fs/btrfs/raid-stripe-tree.h

Comments

Neal Gompa March 2, 2023, 7:38 p.m. UTC | #1
On Thu, Mar 2, 2023 at 4:56 AM Johannes Thumshirn
<johannes.thumshirn@wdc.com> wrote:
>
> Updates of the raid-stripe-tree are done at delayed-ref time to safe on
> bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
> i.e. when the logical to physical translation happens for regular btrfs RAID
> as well.
>
> The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
> it's contents are the respective physical device id and position.
>
> For an example 1M write (split into 126K segments due to zone-append)
> rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)
>
> The tree will look as follows:
>
> rapido2:/home/johannes/src/fstests# btrfs inspect-internal dump-tree -t raid_stripe /dev/nullb0
> btrfs-progs v5.16.1
> raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
> leaf 805847040 items 9 free space 15770 generation 9 owner RAID_STRIPE_TREE
> leaf 805847040 flags 0x1(WRITTEN) backref revision 1
> checksum stored 1b22e13800000000000000000000000000000000000000000000000000000000
> checksum calced 1b22e13800000000000000000000000000000000000000000000000000000000
> fs uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
> chunk uuid 6f2d8aaa-d348-4bf2-9b5e-141a37ba4c77
>         item 0 key (939524096 RAID_STRIPE_KEY 126976) itemoff 16251 itemsize 32
>                         stripe 0 devid 1 offset 939524096
>                         stripe 1 devid 2 offset 536870912
>         item 1 key (939651072 RAID_STRIPE_KEY 126976) itemoff 16219 itemsize 32
>                         stripe 0 devid 1 offset 939651072
>                         stripe 1 devid 2 offset 536997888
>         item 2 key (939778048 RAID_STRIPE_KEY 126976) itemoff 16187 itemsize 32
>                         stripe 0 devid 1 offset 939778048
>                         stripe 1 devid 2 offset 537124864
>         item 3 key (939905024 RAID_STRIPE_KEY 126976) itemoff 16155 itemsize 32
>                         stripe 0 devid 1 offset 939905024
>                         stripe 1 devid 2 offset 537251840
>         item 4 key (940032000 RAID_STRIPE_KEY 126976) itemoff 16123 itemsize 32
>                         stripe 0 devid 1 offset 940032000
>                         stripe 1 devid 2 offset 537378816
>         item 5 key (940158976 RAID_STRIPE_KEY 126976) itemoff 16091 itemsize 32
>                         stripe 0 devid 1 offset 940158976
>                         stripe 1 devid 2 offset 537505792
>         item 6 key (940285952 RAID_STRIPE_KEY 126976) itemoff 16059 itemsize 32
>                         stripe 0 devid 1 offset 940285952
>                         stripe 1 devid 2 offset 537632768
>         item 7 key (940412928 RAID_STRIPE_KEY 126976) itemoff 16027 itemsize 32
>                         stripe 0 devid 1 offset 940412928
>                         stripe 1 devid 2 offset 537759744
>         item 8 key (940539904 RAID_STRIPE_KEY 32768) itemoff 15995 itemsize 32
>                         stripe 0 devid 1 offset 940539904
>                         stripe 1 devid 2 offset 537886720
> total bytes 26843545600
> bytes used 1245184
> uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
>
> A design document can be found here:
> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
>
> The user-space part of this series can be found here:
> https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com
>

Apologies if this is a stupid question, but after reading through the
patch series and the design document, it sounds like the crux of this
change is switching how RAID works to be COW like everything else.
Does that also mean RAID 56 modes benefit from this in that manner?



--
真実はいつも一つ!/ Always, there's only one truth!
Johannes Thumshirn March 3, 2023, 8:45 a.m. UTC | #2
On 02.03.23 20:39, Neal Gompa wrote:
>> A design document can be found here:
>> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
>>
>> The user-space part of this series can be found here:
>> https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com
>>
> 
> Apologies if this is a stupid question, but after reading through the
> patch series and the design document, it sounds like the crux of this
> change is switching how RAID works to be COW like everything else.
> Does that also mean RAID 56 modes benefit from this in that manner?
> 

Yep that is the intention once I get far enough to have RAID56 covered.

But this is going to be the next milestone after having RAID0/1/10 done
and working properly for zoned.
Anand Jain March 3, 2023, 9:29 a.m. UTC | #3
Is there a plan to rebase this series to the latest misc-next branch?
Unfortunately, applying this patch fails at multiple patches.

Thanks, Anand



On 02/03/2023 17:45, Johannes Thumshirn wrote:
> Updates of the raid-stripe-tree are done at delayed-ref time to safe on
> bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
> i.e. when the logical to physical translation happens for regular btrfs RAID
> as well.
> 
> The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
> it's contents are the respective physical device id and position.
> 
> For an example 1M write (split into 126K segments due to zone-append)
> rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)
> 
> The tree will look as follows:
> 
> rapido2:/home/johannes/src/fstests# btrfs inspect-internal dump-tree -t raid_stripe /dev/nullb0
> btrfs-progs v5.16.1
> raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
> leaf 805847040 items 9 free space 15770 generation 9 owner RAID_STRIPE_TREE
> leaf 805847040 flags 0x1(WRITTEN) backref revision 1
> checksum stored 1b22e13800000000000000000000000000000000000000000000000000000000
> checksum calced 1b22e13800000000000000000000000000000000000000000000000000000000
> fs uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
> chunk uuid 6f2d8aaa-d348-4bf2-9b5e-141a37ba4c77
>          item 0 key (939524096 RAID_STRIPE_KEY 126976) itemoff 16251 itemsize 32
>                          stripe 0 devid 1 offset 939524096
>                          stripe 1 devid 2 offset 536870912
>          item 1 key (939651072 RAID_STRIPE_KEY 126976) itemoff 16219 itemsize 32
>                          stripe 0 devid 1 offset 939651072
>                          stripe 1 devid 2 offset 536997888
>          item 2 key (939778048 RAID_STRIPE_KEY 126976) itemoff 16187 itemsize 32
>                          stripe 0 devid 1 offset 939778048
>                          stripe 1 devid 2 offset 537124864
>          item 3 key (939905024 RAID_STRIPE_KEY 126976) itemoff 16155 itemsize 32
>                          stripe 0 devid 1 offset 939905024
>                          stripe 1 devid 2 offset 537251840
>          item 4 key (940032000 RAID_STRIPE_KEY 126976) itemoff 16123 itemsize 32
>                          stripe 0 devid 1 offset 940032000
>                          stripe 1 devid 2 offset 537378816
>          item 5 key (940158976 RAID_STRIPE_KEY 126976) itemoff 16091 itemsize 32
>                          stripe 0 devid 1 offset 940158976
>                          stripe 1 devid 2 offset 537505792
>          item 6 key (940285952 RAID_STRIPE_KEY 126976) itemoff 16059 itemsize 32
>                          stripe 0 devid 1 offset 940285952
>                          stripe 1 devid 2 offset 537632768
>          item 7 key (940412928 RAID_STRIPE_KEY 126976) itemoff 16027 itemsize 32
>                          stripe 0 devid 1 offset 940412928
>                          stripe 1 devid 2 offset 537759744
>          item 8 key (940539904 RAID_STRIPE_KEY 32768) itemoff 15995 itemsize 32
>                          stripe 0 devid 1 offset 940539904
>                          stripe 1 devid 2 offset 537886720
> total bytes 26843545600
> bytes used 1245184
> uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
> 
> A design document can be found here:
> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
> 
> The user-space part of this series can be found here:
> https://lore.kernel.org/linux-btrfs/20230215143109.2721722-1-johannes.thumshirn@wdc.com
> 
> Changes to v6:
> - Fix degraded RAID1 mounts
> - Fix RAID0/10 mounts
> 
> v6 of the patchset can be found here:
> https://lore/kernel.org/linux-btrfs/cover.1676470614.git.johannes.thumshirn@wdc.com
> 
> Changes to v5:
> - Incroporated review comments from Josef and Christoph
> - Rebased onto misc-next
> 
> v5 of the patchset can be found here:
> https://lore/kernel.org/linux-btrfs/cover.1675853489.git.johannes.thumshirn@wdc.com
> 
> Changes to v4:
> - Added patch to check for RST feature in sysfs
> - Added RST lookups for scrubbing
> - Fixed the error handling bug Josef pointed out
> - Only check if we need to write out a RST once per delayed_ref head
> - Added support for zoned data DUP with RST
> 
> Changes to v3:
> - Rebased onto 20221120124734.18634-1-hch@lst.de
> - Incorporated Josef's review
> - Merged related patches
> 
> v3 of the patchset can be found here:
> https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com
> 
> Changes to v2:
> - Bug fixes
> - Rebased onto 20220901074216.1849941-1-hch@lst.de
> - Added tracepoints
> - Added leak checker
> - Added RAID0 and RAID10
> 
> v2 of the patchset can be found here:
> https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com
> 
> Changes to v1:
> - Write the stripe-tree at delayed-ref time (Qu)
> - Add a different write path for preallocation
> 
> v1 of the patchset can be found here:
> https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/
> 
> Johannes Thumshirn (13):
>    btrfs: re-add trans parameter to insert_delayed_ref
>    btrfs: add raid stripe tree definitions
>    btrfs: read raid-stripe-tree from disk
>    btrfs: add support for inserting raid stripe extents
>    btrfs: delete stripe extent on extent deletion
>    btrfs: lookup physical address from stripe extent
>    btrfs: add raid stripe tree pretty printer
>    btrfs: zoned: allow zoned RAID
>    btrfs: check for leaks of ordered stripes on umount
>    btrfs: add tracepoints for ordered stripes
>    btrfs: announce presence of raid-stripe-tree in sysfs
>    btrfs: consult raid-stripe-tree when scrubbing
>    btrfs: add raid-stripe-tree to features enabled with debug
> 
>   fs/btrfs/Makefile               |   2 +-
>   fs/btrfs/accessors.h            |  29 +++
>   fs/btrfs/bio.c                  |  29 +++
>   fs/btrfs/block-rsv.c            |   1 +
>   fs/btrfs/delayed-ref.c          |  13 +-
>   fs/btrfs/delayed-ref.h          |   2 +
>   fs/btrfs/disk-io.c              |  24 ++
>   fs/btrfs/disk-io.h              |   5 +
>   fs/btrfs/extent-tree.c          |  68 ++++++
>   fs/btrfs/fs.h                   |   7 +-
>   fs/btrfs/inode.c                |  15 +-
>   fs/btrfs/print-tree.c           |  21 ++
>   fs/btrfs/raid-stripe-tree.c     | 416 ++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.h     |  87 +++++++
>   fs/btrfs/scrub.c                |  33 ++-
>   fs/btrfs/super.c                |   1 +
>   fs/btrfs/sysfs.c                |   3 +
>   fs/btrfs/volumes.c              |  46 +++-
>   fs/btrfs/volumes.h              |  13 +-
>   fs/btrfs/zoned.c                | 119 ++++++++-
>   include/trace/events/btrfs.h    |  50 ++++
>   include/uapi/linux/btrfs.h      |   1 +
>   include/uapi/linux/btrfs_tree.h |  20 +-
>   23 files changed, 973 insertions(+), 32 deletions(-)
>   create mode 100644 fs/btrfs/raid-stripe-tree.c
>   create mode 100644 fs/btrfs/raid-stripe-tree.h
>
Johannes Thumshirn March 3, 2023, 10:32 a.m. UTC | #4
On 03.03.23 10:30, Anand Jain wrote:
> 
> Is there a plan to rebase this series to the latest misc-next branch?
> Unfortunately, applying this patch fails at multiple patches.
> 

Will do. I messed up my latest rebase anyways so thanks for noticing it.