mbox series

[v5,00/13] btrfs: introduce RAID stripe tree

Message ID cover.1675853489.git.johannes.thumshirn@wdc.com (mailing list archive)
Headers show
Series btrfs: introduce RAID stripe tree | expand

Message

Johannes Thumshirn Feb. 8, 2023, 10:57 a.m. UTC
Updates of the raid-stripe-tree are done at delayed-ref time to safe on
bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
i.e. when the logical to physical translation happens for regular btrfs RAID
as well.

The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
it's contents are the respective physical device id and position.

For an example 1M write (split into 126K segments due to zone-append)
rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)

The tree will look as follows:

rapido2:/home/johannes/src/fstests# btrfs inspect-internal dump-tree -t raid_stripe /dev/nullb0
btrfs-progs v5.16.1 
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805847040 items 9 free space 15770 generation 9 owner RAID_STRIPE_TREE
leaf 805847040 flags 0x1(WRITTEN) backref revision 1
checksum stored 1b22e13800000000000000000000000000000000000000000000000000000000
checksum calced 1b22e13800000000000000000000000000000000000000000000000000000000
fs uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
chunk uuid 6f2d8aaa-d348-4bf2-9b5e-141a37ba4c77
        item 0 key (939524096 RAID_STRIPE_KEY 126976) itemoff 16251 itemsize 32
                        stripe 0 devid 1 offset 939524096
                        stripe 1 devid 2 offset 536870912
        item 1 key (939651072 RAID_STRIPE_KEY 126976) itemoff 16219 itemsize 32
                        stripe 0 devid 1 offset 939651072
                        stripe 1 devid 2 offset 536997888
        item 2 key (939778048 RAID_STRIPE_KEY 126976) itemoff 16187 itemsize 32
                        stripe 0 devid 1 offset 939778048
                        stripe 1 devid 2 offset 537124864
        item 3 key (939905024 RAID_STRIPE_KEY 126976) itemoff 16155 itemsize 32
                        stripe 0 devid 1 offset 939905024
                        stripe 1 devid 2 offset 537251840
        item 4 key (940032000 RAID_STRIPE_KEY 126976) itemoff 16123 itemsize 32
                        stripe 0 devid 1 offset 940032000
                        stripe 1 devid 2 offset 537378816
        item 5 key (940158976 RAID_STRIPE_KEY 126976) itemoff 16091 itemsize 32
                        stripe 0 devid 1 offset 940158976
                        stripe 1 devid 2 offset 537505792
        item 6 key (940285952 RAID_STRIPE_KEY 126976) itemoff 16059 itemsize 32
                        stripe 0 devid 1 offset 940285952
                        stripe 1 devid 2 offset 537632768
        item 7 key (940412928 RAID_STRIPE_KEY 126976) itemoff 16027 itemsize 32
                        stripe 0 devid 1 offset 940412928
                        stripe 1 devid 2 offset 537759744
        item 8 key (940539904 RAID_STRIPE_KEY 32768) itemoff 15995 itemsize 32
                        stripe 0 devid 1 offset 940539904
                        stripe 1 devid 2 offset 537886720
total bytes 26843545600
bytes used 1245184
uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb

A design document can be found here:
https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true


Changes to v4:
- Added patch to check for RST feature in sysfs
- Added RST lookups for scrubbing 
- Fixed the error handling bug Josef pointed out
- Only check if we need to write out a RST once per delayed_ref head
- Added support for zoned data DUP with RST

Changes to v3:
- Rebased onto 20221120124734.18634-1-hch@lst.de
- Incorporated Josef's review
- Merged related patches

v3 of the patchset can be found here:
https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com

Changes to v2:
- Bug fixes
- Rebased onto 20220901074216.1849941-1-hch@lst.de
- Added tracepoints
- Added leak checker
- Added RAID0 and RAID10

v2 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com

Changes to v1:
- Write the stripe-tree at delayed-ref time (Qu)
- Add a different write path for preallocation

v1 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/

Johannes Thumshirn (13):
  btrfs: re-add trans parameter to insert_delayed_ref
  btrfs: add raid stripe tree definitions
  btrfs: read raid-stripe-tree from disk
  btrfs: add support for inserting raid stripe extents
  btrfs: delete stripe extent on extent deletion
  btrfs: lookup physical address from stripe extent
  btrfs: add raid stripe tree pretty printer
  btrfs: zoned: allow zoned RAID
  btrfs: check for leaks of ordered stripes on umount
  btrfs: add tracepoints for ordered stripes
  btrfs: announce presence of raid-stripe-tree in sysfs
  btrfs: consult raid-stripe-tree when scrubbing
  btrfs: add raid-stripe-tree to features enabled with debug

 fs/btrfs/Makefile               |   2 +-
 fs/btrfs/accessors.h            |  29 +++
 fs/btrfs/bio.c                  |  29 +++
 fs/btrfs/bio.h                  |   2 +
 fs/btrfs/block-rsv.c            |   1 +
 fs/btrfs/delayed-ref.c          |  13 +-
 fs/btrfs/delayed-ref.h          |   2 +
 fs/btrfs/disk-io.c              |  30 ++-
 fs/btrfs/disk-io.h              |   5 +
 fs/btrfs/extent-tree.c          |  68 ++++++
 fs/btrfs/fs.h                   |   8 +-
 fs/btrfs/inode.c                |  15 +-
 fs/btrfs/print-tree.c           |  21 ++
 fs/btrfs/raid-stripe-tree.c     | 415 ++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h     |  87 +++++++
 fs/btrfs/scrub.c                |  33 ++-
 fs/btrfs/super.c                |   1 +
 fs/btrfs/sysfs.c                |   3 +
 fs/btrfs/volumes.c              |  39 ++-
 fs/btrfs/volumes.h              |  12 +-
 fs/btrfs/zoned.c                |  49 +++-
 include/trace/events/btrfs.h    |  50 ++++
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  20 +-
 24 files changed, 905 insertions(+), 30 deletions(-)
 create mode 100644 fs/btrfs/raid-stripe-tree.c
 create mode 100644 fs/btrfs/raid-stripe-tree.h

Comments

Qu Wenruo Feb. 9, 2023, 12:42 a.m. UTC | #1
On 2023/2/8 18:57, Johannes Thumshirn wrote:
> Updates of the raid-stripe-tree are done at delayed-ref time to safe on
> bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
> i.e. when the logical to physical translation happens for regular btrfs RAID
> as well.
> 
> The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
> it's contents are the respective physical device id and position.
> 
> For an example 1M write (split into 126K segments due to zone-append)
> rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
> wrote 1048576/1048576 bytes at offset 0
> 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)
> 
> The tree will look as follows:
> 
> rapido2:/home/johannes/src/fstests# btrfs inspect-internal dump-tree -t raid_stripe /dev/nullb0
> btrfs-progs v5.16.1
> raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
> leaf 805847040 items 9 free space 15770 generation 9 owner RAID_STRIPE_TREE
> leaf 805847040 flags 0x1(WRITTEN) backref revision 1
> checksum stored 1b22e13800000000000000000000000000000000000000000000000000000000
> checksum calced 1b22e13800000000000000000000000000000000000000000000000000000000
> fs uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
> chunk uuid 6f2d8aaa-d348-4bf2-9b5e-141a37ba4c77
>          item 0 key (939524096 RAID_STRIPE_KEY 126976) itemoff 16251 itemsize 32
>                          stripe 0 devid 1 offset 939524096
>                          stripe 1 devid 2 offset 536870912

Considering we already have the length as the key offset, can we merge 
continuous entries?

I'm pretty sure if we go this path, the rst tree itself can be too 
large, and it's better we consider this before it's too problematic.

Thanks,
Qu

>          item 1 key (939651072 RAID_STRIPE_KEY 126976) itemoff 16219 itemsize 32
>                          stripe 0 devid 1 offset 939651072
>                          stripe 1 devid 2 offset 536997888
>          item 2 key (939778048 RAID_STRIPE_KEY 126976) itemoff 16187 itemsize 32
>                          stripe 0 devid 1 offset 939778048
>                          stripe 1 devid 2 offset 537124864
>          item 3 key (939905024 RAID_STRIPE_KEY 126976) itemoff 16155 itemsize 32
>                          stripe 0 devid 1 offset 939905024
>                          stripe 1 devid 2 offset 537251840
>          item 4 key (940032000 RAID_STRIPE_KEY 126976) itemoff 16123 itemsize 32
>                          stripe 0 devid 1 offset 940032000
>                          stripe 1 devid 2 offset 537378816
>          item 5 key (940158976 RAID_STRIPE_KEY 126976) itemoff 16091 itemsize 32
>                          stripe 0 devid 1 offset 940158976
>                          stripe 1 devid 2 offset 537505792
>          item 6 key (940285952 RAID_STRIPE_KEY 126976) itemoff 16059 itemsize 32
>                          stripe 0 devid 1 offset 940285952
>                          stripe 1 devid 2 offset 537632768
>          item 7 key (940412928 RAID_STRIPE_KEY 126976) itemoff 16027 itemsize 32
>                          stripe 0 devid 1 offset 940412928
>                          stripe 1 devid 2 offset 537759744
>          item 8 key (940539904 RAID_STRIPE_KEY 32768) itemoff 15995 itemsize 32
>                          stripe 0 devid 1 offset 940539904
>                          stripe 1 devid 2 offset 537886720
> total bytes 26843545600
> bytes used 1245184
> uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
> 
> A design document can be found here:
> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
> 
> 
> Changes to v4:
> - Added patch to check for RST feature in sysfs
> - Added RST lookups for scrubbing
> - Fixed the error handling bug Josef pointed out
> - Only check if we need to write out a RST once per delayed_ref head
> - Added support for zoned data DUP with RST
> 
> Changes to v3:
> - Rebased onto 20221120124734.18634-1-hch@lst.de
> - Incorporated Josef's review
> - Merged related patches
> 
> v3 of the patchset can be found here:
> https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com
> 
> Changes to v2:
> - Bug fixes
> - Rebased onto 20220901074216.1849941-1-hch@lst.de
> - Added tracepoints
> - Added leak checker
> - Added RAID0 and RAID10
> 
> v2 of the patchset can be found here:
> https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com
> 
> Changes to v1:
> - Write the stripe-tree at delayed-ref time (Qu)
> - Add a different write path for preallocation
> 
> v1 of the patchset can be found here:
> https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/
> 
> Johannes Thumshirn (13):
>    btrfs: re-add trans parameter to insert_delayed_ref
>    btrfs: add raid stripe tree definitions
>    btrfs: read raid-stripe-tree from disk
>    btrfs: add support for inserting raid stripe extents
>    btrfs: delete stripe extent on extent deletion
>    btrfs: lookup physical address from stripe extent
>    btrfs: add raid stripe tree pretty printer
>    btrfs: zoned: allow zoned RAID
>    btrfs: check for leaks of ordered stripes on umount
>    btrfs: add tracepoints for ordered stripes
>    btrfs: announce presence of raid-stripe-tree in sysfs
>    btrfs: consult raid-stripe-tree when scrubbing
>    btrfs: add raid-stripe-tree to features enabled with debug
> 
>   fs/btrfs/Makefile               |   2 +-
>   fs/btrfs/accessors.h            |  29 +++
>   fs/btrfs/bio.c                  |  29 +++
>   fs/btrfs/bio.h                  |   2 +
>   fs/btrfs/block-rsv.c            |   1 +
>   fs/btrfs/delayed-ref.c          |  13 +-
>   fs/btrfs/delayed-ref.h          |   2 +
>   fs/btrfs/disk-io.c              |  30 ++-
>   fs/btrfs/disk-io.h              |   5 +
>   fs/btrfs/extent-tree.c          |  68 ++++++
>   fs/btrfs/fs.h                   |   8 +-
>   fs/btrfs/inode.c                |  15 +-
>   fs/btrfs/print-tree.c           |  21 ++
>   fs/btrfs/raid-stripe-tree.c     | 415 ++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.h     |  87 +++++++
>   fs/btrfs/scrub.c                |  33 ++-
>   fs/btrfs/super.c                |   1 +
>   fs/btrfs/sysfs.c                |   3 +
>   fs/btrfs/volumes.c              |  39 ++-
>   fs/btrfs/volumes.h              |  12 +-
>   fs/btrfs/zoned.c                |  49 +++-
>   include/trace/events/btrfs.h    |  50 ++++
>   include/uapi/linux/btrfs.h      |   1 +
>   include/uapi/linux/btrfs_tree.h |  20 +-
>   24 files changed, 905 insertions(+), 30 deletions(-)
>   create mode 100644 fs/btrfs/raid-stripe-tree.c
>   create mode 100644 fs/btrfs/raid-stripe-tree.h
>
Johannes Thumshirn Feb. 9, 2023, 8:47 a.m. UTC | #2
On 09.02.23 01:42, Qu Wenruo wrote:
> Considering we already have the length as the key offset, can we merge 
> continuous entries?
> 
> I'm pretty sure if we go this path, the rst tree itself can be too 
> large, and it's better we consider this before it's too problematic.

Yes this is something I was considering to do as a 3rd (or 4th) step,
once the basics are landed.

It can be easily done afterwards without breaking any eventual 
existing installations.
Phillip Susi Feb. 9, 2023, 3:57 p.m. UTC | #3
Johannes Thumshirn <johannes.thumshirn@wdc.com> writes:

> A design document can be found here:
> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true

Nice document, but I'm still not quite sure I understand the problem.
As long as both disks have the same zone layout, and the raid chunk is
aligned to the start of a zone, then shouldn't they be appended together
and have a deterministic layout?

If so, then is this additional metadata just needed in the case where
the disks *don't* have the same zone layout?

If so, then is this an optional feature that would only be enabled when
the disks don't have the same zone layout?
Johannes Thumshirn Feb. 10, 2023, 8:44 a.m. UTC | #4
On 09.02.23 17:01, Phillip Susi wrote:
> 
> Johannes Thumshirn <johannes.thumshirn@wdc.com> writes:
> 
>> A design document can be found here:
>> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
> 
> Nice document, but I'm still not quite sure I understand the problem.
> As long as both disks have the same zone layout, and the raid chunk is
> aligned to the start of a zone, then shouldn't they be appended together
> and have a deterministic layout?
> 
> If so, then is this additional metadata just needed in the case where
> the disks *don't* have the same zone layout?
> 
> If so, then is this an optional feature that would only be enabled when
> the disks don't have the same zone layout?
> 
> 

No. With zoned drives we're writing using the Zone Append command [1].
This has several advantages, one being that you can issue IO at a high
queue depth and don't need any locking to. But it has one downside for
the RAID application, that is, that you don't have any control of the 
LBA where the data lands, only the zone.

Therefor we need another logical to physical mapping layer, which is
the RAID stripe tree. Coincidentally we can also use this tree to do
l2p mapping for RAID5/6 and eliminate the write hole this way.


[1] https://zonedstorage.io/docs/introduction/zns#zone-append
Johannes Thumshirn Feb. 10, 2023, 10:33 a.m. UTC | #5
On 10.02.23 09:44, Johannes Thumshirn wrote:
> On 09.02.23 17:01, Phillip Susi wrote:
>>
>> Johannes Thumshirn <johannes.thumshirn@wdc.com> writes:
>>
>>> A design document can be found here:
>>> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true
>>
>> Nice document, but I'm still not quite sure I understand the problem.
>> As long as both disks have the same zone layout, and the raid chunk is
>> aligned to the start of a zone, then shouldn't they be appended together
>> and have a deterministic layout?
>>
>> If so, then is this additional metadata just needed in the case where
>> the disks *don't* have the same zone layout?
>>
>> If so, then is this an optional feature that would only be enabled when
>> the disks don't have the same zone layout?
>>
>>
> 
> No. With zoned drives we're writing using the Zone Append command [1].
> This has several advantages, one being that you can issue IO at a high
> queue depth and don't need any locking to. But it has one downside for
> the RAID application, that is, that you don't have any control of the 
> LBA where the data lands, only the zone.
> 
> Therefor we need another logical to physical mapping layer, which is
> the RAID stripe tree. Coincidentally we can also use this tree to do
> l2p mapping for RAID5/6 and eliminate the write hole this way.
> 
> 
> [1] https://zonedstorage.io/docs/introduction/zns#zone-append
> 

Actually that's the one I was looking for:
https://zonedstorage.io/docs/introduction/zoned-storage#zone-append
Phillip Susi Feb. 13, 2023, 4:42 p.m. UTC | #6
Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes:

> No. With zoned drives we're writing using the Zone Append command [1].
> This has several advantages, one being that you can issue IO at a high
> queue depth and don't need any locking to. But it has one downside for
> the RAID application, that is, that you don't have any control of the 
> LBA where the data lands, only the zone.

Can they be reordered in the queue?  As long as they are issued in the
same order on both drives and can't get reordered, I would think that
the write pointer on both drives would remain in sync.
Johannes Thumshirn Feb. 13, 2023, 5:44 p.m. UTC | #7
On 13.02.23 17:47, Phillip Susi wrote:
> 
> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes:
> 
>> No. With zoned drives we're writing using the Zone Append command [1].
>> This has several advantages, one being that you can issue IO at a high
>> queue depth and don't need any locking to. But it has one downside for
>> the RAID application, that is, that you don't have any control of the 
>> LBA where the data lands, only the zone.
> 
> Can they be reordered in the queue?  As long as they are issued in the
> same order on both drives and can't get reordered, I would think that
> the write pointer on both drives would remain in sync.
> 

There is no guarantee for that, no. The block layer can theoretically
re-order all WRITEs. This is why btrfs also needs the mq-deadline IO 
scheduler as metadata is written as WRITE with QD=1 (protected by the
btrfs_meta_io_lock() inside btrfs and the zone write lock in the 
IO scheduler.

I unfortunately can't remember the exact reasons why the block layer
cannot be made in a way that it can't re-order the IO. I'd have to defer
that question to Christoph.
Phillip Susi Feb. 13, 2023, 5:56 p.m. UTC | #8
Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes:

> There is no guarantee for that, no. The block layer can theoretically
> re-order all WRITEs. This is why btrfs also needs the mq-deadline IO

Unless you submit barriers to prevent that right?  Why not do that?

> scheduler as metadata is written as WRITE with QD=1 (protected by the
> btrfs_meta_io_lock() inside btrfs and the zone write lock in the 
> IO scheduler.
>
> I unfortunately can't remember the exact reasons why the block layer
> cannot be made in a way that it can't re-order the IO. I'd have to defer
> that question to Christoph.

I would think that to prevent fragmentation, you would want to try to
flush a large portion of data from a particular file in order then move
to another file.  If you have large streaming writes to two files at the
same time and the allocator decides to put them in the same zone, and
they are just submitted to the stack to do in any order, isn't this
likely to lead to a lot of fragmentation?
Christoph Hellwig Feb. 14, 2023, 5:51 a.m. UTC | #9
On Mon, Feb 13, 2023 at 05:44:38PM +0000, Johannes Thumshirn wrote:
> I unfortunately can't remember the exact reasons why the block layer
> cannot be made in a way that it can't re-order the IO. I'd have to defer
> that question to Christoph.

That block layer can avoid reordering, but it's very costly and limits
you to a single queue instead of the multiple queues that blk-mq has.

Similarly the protocol and device can reorder or more typically just
not preserve order (e.g. multiple queues, multiple connections, error
handling).
Christoph Hellwig Feb. 14, 2023, 5:52 a.m. UTC | #10
On Mon, Feb 13, 2023 at 12:56:09PM -0500, Phillip Susi wrote:
> 
> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes:
> 
> > There is no guarantee for that, no. The block layer can theoretically
> > re-order all WRITEs. This is why btrfs also needs the mq-deadline IO
> 
> Unless you submit barriers to prevent that right?  Why not do that?

There is no such thing as a "barrier" since 2.6.10 or so.  And that's
a good thing as they are extremely costly.

> I would think that to prevent fragmentation, you would want to try to
> flush a large portion of data from a particular file in order then move
> to another file.  If you have large streaming writes to two files at the
> same time and the allocator decides to put them in the same zone, and
> they are just submitted to the stack to do in any order, isn't this
> likely to lead to a lot of fragmentation?

If you submit small chunks of different files to the same block group
you're always going to get fragmentation, zones or not.  Zones will
make it even worse due to the lack of preallocations or sparse use
of the block groups, though.