[RFC,v3,00/11] btrfs: raid-stripe-tree draft patches

Message ID	cover.1666007330.git.johannes.thumshirn@wdc.com (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> IronPort-SDR: 5qbNo28ycaUXWQvH6Ob1T+rD6GO6NzYei0nRKVOj96HKkHz59yeF6FNMSyOBLjwE3b7stY+OIg 33TNqHVoezddglXh+GaprtimnhENWeOMYqjHDAYF0+M6gYod2NbB+meJUPnPYJHhRR5cezJr2+ zvizXwwsqk91Lgh3eoZUU1F3wzkp8u5/ad+Dg/sC8GLzyAy1JiUDx9zEStwVXRj/Aj8hzrz4ON Kt3bGWTc247tJYp/QrVw4zGC9Z8WRrno/RDByX3kCucjvA/afFOWBZqAsZ3MTUsnkgUjotFuTV vjW/PWg2OgK11YpAW5SUyWJp IronPort-SDR: VN2A8SG1v3mGTlVngu1qPyAoxEoU1hdisJqbLFzeztHPaAMnhnGZzCJM6MPYtybWbswokRCgXz F0+JOfnJmYtALnWqPlhiX27F/FUvM/HSzn9b61U7hSvZtx1sdFqQBolGjOJITMZ4WSQ8xwxQDP FuTs6DpE7PGbw/H02g8AmKN+yfyeJbReSpG0Fp1vrH8Al8iPejzsZjQDT5NyG1Ilw0QDBK+8c6 jiUHrTIXiEHt7YMyEJ7XBGRDKEvr/bRtllwDE+eXpX2JUcrPzYfrYPFGe3bUs/7FKhWpn44P72 EeU= WDCIronportException: Internal From: Johannes Thumshirn <johannes.thumshirn@wdc.com> To: linux-btrfs@vger.kernel.org Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Subject: [RFC v3 00/11] btrfs: raid-stripe-tree draft patches Date: Mon, 17 Oct 2022 04:55:18 -0700 Message-Id: <cover.1666007330.git.johannes.thumshirn@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: raid-stripe-tree draft patches \| expand [RFC,v3,00/11] btrfs: raid-stripe-tree draft patches [RFC,v3,01/11] btrfs: add raid stripe tree definitions [RFC,v3,02/11] btrfs: read raid-stripe-tree from disk [RFC,v3,03/11] btrfs: add support for inserting raid stripe extents [RFC,v3,04/11] btrfs: delete stripe extent on extent deletion [RFC,v3,05/11] btrfs: lookup physical address from stripe extent [RFC,v3,06/11] btrfs: add raid stripe tree pretty printer [RFC,v3,07/11] btrfs: zoned: allow zoned RAID1 [RFC,v3,08/11] btrfs: allow zoned RAID0 and 10 [RFC,v3,09/11] btrfs: fix striping with RST [RFC,v3,10/11] btrfs: check for leaks of ordered stripes on umount [RFC,v3,11/11] btrfs: add tracepoints for ordered stripes

Message ID

cover.1666007330.git.johannes.thumshirn@wdc.com (mailing list archive)

Headers

IronPort-SDR: 
 5qbNo28ycaUXWQvH6Ob1T+rD6GO6NzYei0nRKVOj96HKkHz59yeF6FNMSyOBLjwE3b7stY+OIg
 33TNqHVoezddglXh+GaprtimnhENWeOMYqjHDAYF0+M6gYod2NbB+meJUPnPYJHhRR5cezJr2+
 zvizXwwsqk91Lgh3eoZUU1F3wzkp8u5/ad+Dg/sC8GLzyAy1JiUDx9zEStwVXRj/Aj8hzrz4ON
 Kt3bGWTc247tJYp/QrVw4zGC9Z8WRrno/RDByX3kCucjvA/afFOWBZqAsZ3MTUsnkgUjotFuTV
 vjW/PWg2OgK11YpAW5SUyWJp
IronPort-SDR: 
 VN2A8SG1v3mGTlVngu1qPyAoxEoU1hdisJqbLFzeztHPaAMnhnGZzCJM6MPYtybWbswokRCgXz
 F0+JOfnJmYtALnWqPlhiX27F/FUvM/HSzn9b61U7hSvZtx1sdFqQBolGjOJITMZ4WSQ8xwxQDP
 FuTs6DpE7PGbw/H02g8AmKN+yfyeJbReSpG0Fp1vrH8Al8iPejzsZjQDT5NyG1Ilw0QDBK+8c6
 jiUHrTIXiEHt7YMyEJ7XBGRDKEvr/bRtllwDE+eXpX2JUcrPzYfrYPFGe3bUs/7FKhWpn44P72
 EeU=
WDCIronportException: Internal
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
To: linux-btrfs@vger.kernel.org
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Subject: [RFC v3 00/11] btrfs: raid-stripe-tree draft patches
Date: Mon, 17 Oct 2022 04:55:18 -0700
Message-Id: <cover.1666007330.git.johannes.thumshirn@wdc.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

btrfs: raid-stripe-tree draft patches | expand

Message

Johannes Thumshirn Oct. 17, 2022, 11:55 a.m. UTC

Here's a yet another draft of my btrfs zoned RAID patches. It's based on
Christoph's bio splitting series for btrfs.

Updates of the raid-stripe-tree are done at delayed-ref time to safe on
bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
i.e. when the logical to physical translation happens for regular btrfs RAID
as well.

The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
it's contents are the respective physical device id and position.

For an example 1M write (split into 126K segments due to zone-append)
rapido2:/home/johannes/src/fstests# xfs_io -fdc "pwrite -b 1M 0 1M" -c fsync /mnt/test/test
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0065 sec (151.538 MiB/sec and 151.5381 ops/sec)

The tree will look as follows:

rapido2:/home/johannes/src/fstests# btrfs inspect-internal dump-tree -t raid_stripe /dev/nullb0
btrfs-progs v5.16.1 
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 805847040 items 9 free space 15770 generation 9 owner RAID_STRIPE_TREE
leaf 805847040 flags 0x1(WRITTEN) backref revision 1
checksum stored 1b22e13800000000000000000000000000000000000000000000000000000000
checksum calced 1b22e13800000000000000000000000000000000000000000000000000000000
fs uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb
chunk uuid 6f2d8aaa-d348-4bf2-9b5e-141a37ba4c77
        item 0 key (939524096 RAID_STRIPE_KEY 126976) itemoff 16251 itemsize 32
                        stripe 0 devid 1 offset 939524096
                        stripe 1 devid 2 offset 536870912
        item 1 key (939651072 RAID_STRIPE_KEY 126976) itemoff 16219 itemsize 32
                        stripe 0 devid 1 offset 939651072
                        stripe 1 devid 2 offset 536997888
        item 2 key (939778048 RAID_STRIPE_KEY 126976) itemoff 16187 itemsize 32
                        stripe 0 devid 1 offset 939778048
                        stripe 1 devid 2 offset 537124864
        item 3 key (939905024 RAID_STRIPE_KEY 126976) itemoff 16155 itemsize 32
                        stripe 0 devid 1 offset 939905024
                        stripe 1 devid 2 offset 537251840
        item 4 key (940032000 RAID_STRIPE_KEY 126976) itemoff 16123 itemsize 32
                        stripe 0 devid 1 offset 940032000
                        stripe 1 devid 2 offset 537378816
        item 5 key (940158976 RAID_STRIPE_KEY 126976) itemoff 16091 itemsize 32
                        stripe 0 devid 1 offset 940158976
                        stripe 1 devid 2 offset 537505792
        item 6 key (940285952 RAID_STRIPE_KEY 126976) itemoff 16059 itemsize 32
                        stripe 0 devid 1 offset 940285952
                        stripe 1 devid 2 offset 537632768
        item 7 key (940412928 RAID_STRIPE_KEY 126976) itemoff 16027 itemsize 32
                        stripe 0 devid 1 offset 940412928
                        stripe 1 devid 2 offset 537759744
        item 8 key (940539904 RAID_STRIPE_KEY 32768) itemoff 15995 itemsize 32
                        stripe 0 devid 1 offset 940539904
                        stripe 1 devid 2 offset 537886720
total bytes 26843545600
bytes used 1245184
uuid e4f523d1-89a1-41f9-ab75-6ba3c42a28fb

Changes to v2:
- Bug fixes
- Rebased onto 20220901074216.1849941-1-hch@lst.de
- Added tracepoints
- Added leak checker
- Added RAID0 and RAID10

v2 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1656513330.git.johannes.thumshirn@wdc.com

Changes to v1:
- Write the stripe-tree at delayed-ref time (Qu)
- Add a different write path for preallocation

v1 of the patchset can be found here:
https://lore.kernel.org/linux-btrfs/cover.1652711187.git.johannes.thumshirn@wdc.com/

Johannes Thumshirn (11):
  btrfs: add raid stripe tree definitions
  btrfs: read raid-stripe-tree from disk
  btrfs: add support for inserting raid stripe extents
  btrfs: delete stripe extent on extent deletion
  btrfs: lookup physical address from stripe extent
  btrfs: add raid stripe tree pretty printer
  btrfs: zoned: allow zoned RAID1
  btrfs: allow zoned RAID0 and 10
  btrfs: fix striping with RST
  btrfs: check for leaks of ordered stripes on umount
  btrfs: add tracepoints for ordered stripes

 fs/btrfs/Makefile               |   2 +-
 fs/btrfs/block-rsv.c            |   1 +
 fs/btrfs/ctree.h                |  33 +++
 fs/btrfs/disk-io.c              |  17 ++
 fs/btrfs/extent-tree.c          |  56 +++++
 fs/btrfs/inode.c                |   6 +
 fs/btrfs/print-tree.c           |  21 ++
 fs/btrfs/raid-stripe-tree.c     | 394 ++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h     |  60 +++++
 fs/btrfs/super.c                |   1 +
 fs/btrfs/volumes.c              |  66 +++++-
 fs/btrfs/volumes.h              |  14 +-
 fs/btrfs/zoned.c                |  43 ++++
 include/trace/events/btrfs.h    |  50 ++++
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |  20 +-
 16 files changed, 768 insertions(+), 17 deletions(-)
 create mode 100644 fs/btrfs/raid-stripe-tree.c
 create mode 100644 fs/btrfs/raid-stripe-tree.h

Comments

Josef Bacik Oct. 20, 2022, 3:42 p.m. UTC | #1

On Mon, Oct 17, 2022 at 04:55:18AM -0700, Johannes Thumshirn wrote:
> Here's a yet another draft of my btrfs zoned RAID patches. It's based on
> Christoph's bio splitting series for btrfs.
> 
> Updates of the raid-stripe-tree are done at delayed-ref time to safe on
> bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
> i.e. when the logical to physical translation happens for regular btrfs RAID
> as well.
> 
> The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
> it's contents are the respective physical device id and position.
> 

So generally I'm good with this design and everything, I just have a few asks

1. I want a design doc for btrfs-dev-docs that lays out the design and how it's
   meant to be used.  The ondisk stuff, as well as the post update after the
   delayed ref runs.
2. Additionally, I would love to see it written down where exactly you want to
   use this in the future.  I know you've talked about using it for other raid
   levels, but I have a hard time paying attention to my own stuff so I'd like
   to see what the long term vision is for this, again this would probably be
   well suited for the btrfs-dev-docs update.
3. I super don't love the fact that we have mirrored extents in both places,
   especially with zoned stripping it down to 128k, this tree is going to be
   huge.  There's no way around this, but this makes the global roots thing even
   more important for scalability with zoned+RST.  I don't really think you need
   to add that bit here now, I'll make it global in my patches for extent tree
   v2.  Mostly I'm just lamenting that you're going to be ready before me and
   now you'll have to wait for the benefits of the global roots work.

Thanks,

Josef

Johannes Thumshirn Oct. 21, 2022, 8:40 a.m. UTC | #2

On 20.10.22 17:42, Josef Bacik wrote:
> On Mon, Oct 17, 2022 at 04:55:18AM -0700, Johannes Thumshirn wrote:
>> Here's a yet another draft of my btrfs zoned RAID patches. It's based on
>> Christoph's bio splitting series for btrfs.
>>
>> Updates of the raid-stripe-tree are done at delayed-ref time to safe on
>> bandwidth while for reading we do the stripe-tree lookup on bio mapping time,
>> i.e. when the logical to physical translation happens for regular btrfs RAID
>> as well.
>>
>> The stripe tree is keyed by an extent's disk_bytenr and disk_num_bytes and
>> it's contents are the respective physical device id and position.
>>
> 
> So generally I'm good with this design and everything, I just have a few asks
> 
> 1. I want a design doc for btrfs-dev-docs that lays out the design and how it's
>    meant to be used.  The ondisk stuff, as well as the post update after the
>    delayed ref runs.
> 2. Additionally, I would love to see it written down where exactly you want to
>    use this in the future.  I know you've talked about using it for other raid
>    levels, but I have a hard time paying attention to my own stuff so I'd like
>    to see what the long term vision is for this, again this would probably be
>    well suited for the btrfs-dev-docs update.

Sure I'll add a document to btrfs-dev-docs (and sent it to the list for review).

There's still a problem with the delayed-ref update to be solved resulting in the
leak checker yelling on unmount, so maybe documenting what I've done, shows me 
where I messed up.

> 3. I super don't love the fact that we have mirrored extents in both places,
>    especially with zoned stripping it down to 128k, this tree is going to be
>    huge.  There's no way around this, but this makes the global roots thing even
>    more important for scalability with zoned+RST.  I don't really think you need
>    to add that bit here now, I'll make it global in my patches for extent tree
>    v2.  Mostly I'm just lamenting that you're going to be ready before me and
>    now you'll have to wait for the benefits of the global roots work.

Well I'm pretty sure I won't be done before the global roots work is. But I agree
the extra amount of metadata for the RST is a concern to me as well. Especially for
overwrite kind of workloads it produces a lot of new extents for each CoW we write.
Combine that with zoned and we really need working reclaim, otherwise it goes all
down the drench.

Can you please remind me, with your global roots am I getting a root per metadata
or per data block group?