Message ID | cover.1675853489.git.johannes.thumshirn@wdc.com (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: introduce RAID stripe tree | expand |
On 2023/2/8 18:57, Johannes Thumshirn wrote: > Updates of the raid-stripe-tree are done at delayed-ref time to safe on > bandwidth while for reading we do the stripe-tree lookup on bio mapping time, > i.e. when the logical to physical translation happens for regular btrfs RAID > as well. > > The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and > it's contents are the respective physical device id and position. > > For an example 1M write (split into 126K segments due to zone-append) > rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test > wrote 1048576/1048576 bytes at offset 0 > 1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec) > > The tree will look as follows: > > rapido2:/home/johannes/src/fstests# btrfs inspect-internal dump-tree -t raid_stripe /dev/nullb0 > btrfs-progs v5.16.1 > raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) > leaf 805847040 items 9 free space 15770 generation 9 owner RAID_STRIPE_TREE > leaf 805847040 flags 0x1(WRITTEN) backref revision 1 > checksum stored 1b22e13800000000000000000000000000000000000000000000000000000000 > checksum calced 1b22e13800000000000000000000000000000000000000000000000000000000 > fs uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb > chunk uuid 6f2d8aaa-d348-4bf2-9b5e-141a37ba4c77 > item 0 key (939524096 RAID_STRIPE_KEY 126976) itemoff 16251 itemsize 32 > stripe 0 devid 1 offset 939524096 > stripe 1 devid 2 offset 536870912 Considering we already have the length as the key offset, can we merge continuous entries? I'm pretty sure if we go this path, the rst tree itself can be too large, and it's better we consider this before it's too problematic. Thanks, Qu > item 1 key (939651072 RAID_STRIPE_KEY 126976) itemoff 16219 itemsize 32 > stripe 0 devid 1 offset 939651072 > stripe 1 devid 2 offset 536997888 > item 2 key (939778048 RAID_STRIPE_KEY 126976) itemoff 16187 itemsize 32 > stripe 0 devid 1 offset 939778048 > stripe 1 devid 2 offset 537124864 > item 3 key (939905024 RAID_STRIPE_KEY 126976) itemoff 16155 itemsize 32 > stripe 0 devid 1 offset 939905024 > stripe 1 devid 2 offset 537251840 > item 4 key (940032000 RAID_STRIPE_KEY 126976) itemoff 16123 itemsize 32 > stripe 0 devid 1 offset 940032000 > stripe 1 devid 2 offset 537378816 > item 5 key (940158976 RAID_STRIPE_KEY 126976) itemoff 16091 itemsize 32 > stripe 0 devid 1 offset 940158976 > stripe 1 devid 2 offset 537505792 > item 6 key (940285952 RAID_STRIPE_KEY 126976) itemoff 16059 itemsize 32 > stripe 0 devid 1 offset 940285952 > stripe 1 devid 2 offset 537632768 > item 7 key (940412928 RAID_STRIPE_KEY 126976) itemoff 16027 itemsize 32 > stripe 0 devid 1 offset 940412928 > stripe 1 devid 2 offset 537759744 > item 8 key (940539904 RAID_STRIPE_KEY 32768) itemoff 15995 itemsize 32 > stripe 0 devid 1 offset 940539904 > stripe 1 devid 2 offset 537886720 > total bytes 26843545600 > bytes used 1245184 > uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb > > A design document can be found here: > https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true > > > Changes to v4: > - Added patch to check for RST feature in sysfs > - Added RST lookups for scrubbing > - Fixed the error handling bug Josef pointed out > - Only check if we need to write out a RST once per delayed_ref head > - Added support for zoned data DUP with RST > > Changes to v3: > - Rebased onto 20221120124734.18634-1-hch@lst.de > - Incorporated Josef's review > - Merged related patches > > v3 of the patchset can be found here: > https://lore/kernel.org/linux-btrfs/cover.1666007330.git.johannes.thumshirn@wdc.com > > Changes to v2: > - Bug fixes > - Rebased onto 20220901074216.1849941-1-hch@lst.de > - Added tracepoints > - Added leak checker > - Added RAID0 and RAID10 > > v2 of the patchset can be found here: > https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com > > Changes to v1: > - Write the stripe-tree at delayed-ref time (Qu) > - Add a different write path for preallocation > > v1 of the patchset can be found here: > https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/ > > Johannes Thumshirn (13): > btrfs: re-add trans parameter to insert_delayed_ref > btrfs: add raid stripe tree definitions > btrfs: read raid-stripe-tree from disk > btrfs: add support for inserting raid stripe extents > btrfs: delete stripe extent on extent deletion > btrfs: lookup physical address from stripe extent > btrfs: add raid stripe tree pretty printer > btrfs: zoned: allow zoned RAID > btrfs: check for leaks of ordered stripes on umount > btrfs: add tracepoints for ordered stripes > btrfs: announce presence of raid-stripe-tree in sysfs > btrfs: consult raid-stripe-tree when scrubbing > btrfs: add raid-stripe-tree to features enabled with debug > > fs/btrfs/Makefile | 2 +- > fs/btrfs/accessors.h | 29 +++ > fs/btrfs/bio.c | 29 +++ > fs/btrfs/bio.h | 2 + > fs/btrfs/block-rsv.c | 1 + > fs/btrfs/delayed-ref.c | 13 +- > fs/btrfs/delayed-ref.h | 2 + > fs/btrfs/disk-io.c | 30 ++- > fs/btrfs/disk-io.h | 5 + > fs/btrfs/extent-tree.c | 68 ++++++ > fs/btrfs/fs.h | 8 +- > fs/btrfs/inode.c | 15 +- > fs/btrfs/print-tree.c | 21 ++ > fs/btrfs/raid-stripe-tree.c | 415 ++++++++++++++++++++++++++++++++ > fs/btrfs/raid-stripe-tree.h | 87 +++++++ > fs/btrfs/scrub.c | 33 ++- > fs/btrfs/super.c | 1 + > fs/btrfs/sysfs.c | 3 + > fs/btrfs/volumes.c | 39 ++- > fs/btrfs/volumes.h | 12 +- > fs/btrfs/zoned.c | 49 +++- > include/trace/events/btrfs.h | 50 ++++ > include/uapi/linux/btrfs.h | 1 + > include/uapi/linux/btrfs_tree.h | 20 +- > 24 files changed, 905 insertions(+), 30 deletions(-) > create mode 100644 fs/btrfs/raid-stripe-tree.c > create mode 100644 fs/btrfs/raid-stripe-tree.h >
On 09.02.23 01:42, Qu Wenruo wrote: > Considering we already have the length as the key offset, can we merge > continuous entries? > > I'm pretty sure if we go this path, the rst tree itself can be too > large, and it's better we consider this before it's too problematic. Yes this is something I was considering to do as a 3rd (or 4th) step, once the basics are landed. It can be easily done afterwards without breaking any eventual existing installations.
Johannes Thumshirn <johannes.thumshirn@wdc.com> writes: > A design document can be found here: > https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true Nice document, but I'm still not quite sure I understand the problem. As long as both disks have the same zone layout, and the raid chunk is aligned to the start of a zone, then shouldn't they be appended together and have a deterministic layout? If so, then is this additional metadata just needed in the case where the disks *don't* have the same zone layout? If so, then is this an optional feature that would only be enabled when the disks don't have the same zone layout?
On 09.02.23 17:01, Phillip Susi wrote: > > Johannes Thumshirn <johannes.thumshirn@wdc.com> writes: > >> A design document can be found here: >> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true > > Nice document, but I'm still not quite sure I understand the problem. > As long as both disks have the same zone layout, and the raid chunk is > aligned to the start of a zone, then shouldn't they be appended together > and have a deterministic layout? > > If so, then is this additional metadata just needed in the case where > the disks *don't* have the same zone layout? > > If so, then is this an optional feature that would only be enabled when > the disks don't have the same zone layout? > > No. With zoned drives we're writing using the Zone Append command [1]. This has several advantages, one being that you can issue IO at a high queue depth and don't need any locking to. But it has one downside for the RAID application, that is, that you don't have any control of the LBA where the data lands, only the zone. Therefor we need another logical to physical mapping layer, which is the RAID stripe tree. Coincidentally we can also use this tree to do l2p mapping for RAID5/6 and eliminate the write hole this way. [1] https://zonedstorage.io/docs/introduction/zns#zone-append
On 10.02.23 09:44, Johannes Thumshirn wrote: > On 09.02.23 17:01, Phillip Susi wrote: >> >> Johannes Thumshirn <johannes.thumshirn@wdc.com> writes: >> >>> A design document can be found here: >>> https://docs.google.com/document/d/1Iui_jMidCd4MVBNSSLXRfO7p5KmvnoQL/edit?usp=sharing&ouid=103609947580185458266&rtpof=true&sd=true >> >> Nice document, but I'm still not quite sure I understand the problem. >> As long as both disks have the same zone layout, and the raid chunk is >> aligned to the start of a zone, then shouldn't they be appended together >> and have a deterministic layout? >> >> If so, then is this additional metadata just needed in the case where >> the disks *don't* have the same zone layout? >> >> If so, then is this an optional feature that would only be enabled when >> the disks don't have the same zone layout? >> >> > > No. With zoned drives we're writing using the Zone Append command [1]. > This has several advantages, one being that you can issue IO at a high > queue depth and don't need any locking to. But it has one downside for > the RAID application, that is, that you don't have any control of the > LBA where the data lands, only the zone. > > Therefor we need another logical to physical mapping layer, which is > the RAID stripe tree. Coincidentally we can also use this tree to do > l2p mapping for RAID5/6 and eliminate the write hole this way. > > > [1] https://zonedstorage.io/docs/introduction/zns#zone-append > Actually that's the one I was looking for: https://zonedstorage.io/docs/introduction/zoned-storage#zone-append
Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes: > No. With zoned drives we're writing using the Zone Append command [1]. > This has several advantages, one being that you can issue IO at a high > queue depth and don't need any locking to. But it has one downside for > the RAID application, that is, that you don't have any control of the > LBA where the data lands, only the zone. Can they be reordered in the queue? As long as they are issued in the same order on both drives and can't get reordered, I would think that the write pointer on both drives would remain in sync.
On 13.02.23 17:47, Phillip Susi wrote: > > Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes: > >> No. With zoned drives we're writing using the Zone Append command [1]. >> This has several advantages, one being that you can issue IO at a high >> queue depth and don't need any locking to. But it has one downside for >> the RAID application, that is, that you don't have any control of the >> LBA where the data lands, only the zone. > > Can they be reordered in the queue? As long as they are issued in the > same order on both drives and can't get reordered, I would think that > the write pointer on both drives would remain in sync. > There is no guarantee for that, no. The block layer can theoretically re-order all WRITEs. This is why btrfs also needs the mq-deadline IO scheduler as metadata is written as WRITE with QD=1 (protected by the btrfs_meta_io_lock() inside btrfs and the zone write lock in the IO scheduler. I unfortunately can't remember the exact reasons why the block layer cannot be made in a way that it can't re-order the IO. I'd have to defer that question to Christoph.
Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes: > There is no guarantee for that, no. The block layer can theoretically > re-order all WRITEs. This is why btrfs also needs the mq-deadline IO Unless you submit barriers to prevent that right? Why not do that? > scheduler as metadata is written as WRITE with QD=1 (protected by the > btrfs_meta_io_lock() inside btrfs and the zone write lock in the > IO scheduler. > > I unfortunately can't remember the exact reasons why the block layer > cannot be made in a way that it can't re-order the IO. I'd have to defer > that question to Christoph. I would think that to prevent fragmentation, you would want to try to flush a large portion of data from a particular file in order then move to another file. If you have large streaming writes to two files at the same time and the allocator decides to put them in the same zone, and they are just submitted to the stack to do in any order, isn't this likely to lead to a lot of fragmentation?
On Mon, Feb 13, 2023 at 05:44:38PM +0000, Johannes Thumshirn wrote: > I unfortunately can't remember the exact reasons why the block layer > cannot be made in a way that it can't re-order the IO. I'd have to defer > that question to Christoph. That block layer can avoid reordering, but it's very costly and limits you to a single queue instead of the multiple queues that blk-mq has. Similarly the protocol and device can reorder or more typically just not preserve order (e.g. multiple queues, multiple connections, error handling).
On Mon, Feb 13, 2023 at 12:56:09PM -0500, Phillip Susi wrote: > > Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes: > > > There is no guarantee for that, no. The block layer can theoretically > > re-order all WRITEs. This is why btrfs also needs the mq-deadline IO > > Unless you submit barriers to prevent that right? Why not do that? There is no such thing as a "barrier" since 2.6.10 or so. And that's a good thing as they are extremely costly. > I would think that to prevent fragmentation, you would want to try to > flush a large portion of data from a particular file in order then move > to another file. If you have large streaming writes to two files at the > same time and the allocator decides to put them in the same zone, and > they are just submitted to the stack to do in any order, isn't this > likely to lead to a lot of fragmentation? If you submit small chunks of different files to the same block group you're always going to get fragmentation, zones or not. Zones will make it even worse due to the lack of preallocations or sparse use of the block groups, though.