mbox series

[RFC,0/2] Add dummy RAID stripe tree entries for PREALLOC data

Message ID 20240918140850.27261-1-jth@kernel.org (mailing list archive)
Headers show
Series Add dummy RAID stripe tree entries for PREALLOC data | expand

Message

Johannes Thumshirn Sept. 18, 2024, 2:08 p.m. UTC
This is an RFC implementation of Qu's request to be able to
distinguish preallocated extents in the stripe tree for scrub.

It's not 100% working yet but only showing the basic "how it's going to
look like".

I'm not really sure this is a better solution than returning ENOENT
and ignoring it in scrub.

A third possibility would be to do a full backref walk on
btrfs_map_block() error and then check if it's a preallocated extent.

Johannes Thumshirn (2):
  btrfs: introduce dummy RAID stripes for preallocated data
  btrfs: insert dummy RAID stripe extents for preallocated data

 fs/btrfs/inode.c                | 47 +++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.c     | 47 +++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h     |  2 ++
 include/uapi/linux/btrfs_tree.h |  1 +
 4 files changed, 97 insertions(+)

Comments

Qu Wenruo Sept. 18, 2024, 11:45 p.m. UTC | #1
在 2024/9/18 23:38, Johannes Thumshirn 写道:
> This is an RFC implementation of Qu's request to be able to
> distinguish preallocated extents in the stripe tree for scrub.
>
> It's not 100% working yet but only showing the basic "how it's going to
> look like".
>
> I'm not really sure this is a better solution than returning ENOENT
> and ignoring it in scrub.

If RST without zoned supports preallocation and NOCOW, then I think we
should not just insert a dummy, but a real RST entries.

So that NOCOW/preallocated writes can just re-use the existing RST
entry, without any new RST updates.

At least logically it makes more sense.

At least a quick glance into the handling shows, NOCOW writes doesn't
trigger any RST updates, so I guess it should also apply to PREALLOCATED
writes too.

Or did I miss something else?

Thanks,
Qu
>
> A third possibility would be to do a full backref walk on
> btrfs_map_block() error and then check if it's a preallocated extent.
>
> Johannes Thumshirn (2):
>    btrfs: introduce dummy RAID stripes for preallocated data
>    btrfs: insert dummy RAID stripe extents for preallocated data
>
>   fs/btrfs/inode.c                | 47 +++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.c     | 47 +++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.h     |  2 ++
>   include/uapi/linux/btrfs_tree.h |  1 +
>   4 files changed, 97 insertions(+)
>
Johannes Thumshirn Sept. 19, 2024, 3:42 p.m. UTC | #2
On 19.09.24 01:46, Qu Wenruo wrote:
> 在 2024/9/18 23:38, Johannes Thumshirn 写道:
>> This is an RFC implementation of Qu's request to be able to
>> distinguish preallocated extents in the stripe tree for scrub.
>>
>> It's not 100% working yet but only showing the basic "how it's going to
>> look like".
>>
>> I'm not really sure this is a better solution than returning ENOENT
>> and ignoring it in scrub.
> 
> If RST without zoned supports preallocation and NOCOW, then I think we
> should not just insert a dummy, but a real RST entries.
> 
> So that NOCOW/preallocated writes can just re-use the existing RST
> entry, without any new RST updates.
> 
> At least logically it makes more sense.
> 
> At least a quick glance into the handling shows, NOCOW writes doesn't
> trigger any RST updates, so I guess it should also apply to PREALLOCATED
> writes too.

Yup, got a fix for NOCOW. That again was a stupid mistake because I only 
had zoned in mind.

For prealloc I don't this I should add RST entries, because there's 
nothing on disk for these ranges. It's pre-allocated after all.

Once we write into the range, RST entries will be created:

virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" test
virtme-scsi:/mnt # btrfs ins dump-tree -t extent /dev/sda | grep -A 1 
EXTENT_ITEM
         item 13 key (298844160 EXTENT_ITEM 1048576) itemoff 15828 
itemsize 53
                 refs 1 gen 8 flags DATA
virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda  | grep -A 3 
EXTENT_DATA
         item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
                 generation 8 type 2 (prealloc)
                 prealloc data disk byte 298844160 nr 1048576
                 prealloc data offset 0 nr 1048576
virtme-scsi:/mnt # btrfs ins dump-tree -t raid-stripe /dev/sda
btrfs-progs v6.9
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 5259264 items 0 free space 16283 generation 3 owner RAID_STRIPE_TREE
leaf 5259264 flags 0x1(WRITTEN) backref revision 1
checksum stored f54ade94
checksum calced f54ade94
fs uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
chunk uuid 463f7b1d-c0b8-4373-8334-52f5bf83475e
total bytes 21474836480
bytes used 1212416
uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
virtme-scsi:/mnt # xfs_io -fc "pwrite 64k 64k" test
wrote 65536/65536 bytes at offset 65536
64 KiB, 16 ops; 0.0003 sec (183.284 MiB/sec and 46920.8211 ops/sec)
virtme-scsi:/mnt # btrfs ins dump-tree -t extent /dev/sda | grep -A 1 
EXTENT_ITEM
         item 13 key (298844160 EXTENT_ITEM 1048576) itemoff 15828 
itemsize 53
                 refs 3 gen 8 flags DATA
virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda | grep -A 3 
EXTENT_DATA
         item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
                 generation 9 type 2 (prealloc)
                 prealloc data disk byte 298844160 nr 1048576
                 prealloc data offset 0 nr 65536
         item 7 key (257 EXTENT_DATA 65536) itemoff 15763 itemsize 53
                 generation 9 type 1 (regular)
                 extent data disk byte 298844160 nr 1048576
                 extent data offset 65536 nr 65536 ram 1048576
--
         item 8 key (257 EXTENT_DATA 131072) itemoff 15710 itemsize 53
                 generation 9 type 2 (prealloc)
                 prealloc data disk byte 298844160 nr 1048576
                 prealloc data offset 131072 nr 917504
virtme-scsi:/mnt # btrfs ins dump-tree -t raid-stripe /dev/sda
btrfs-progs v6.9
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 30638080 items 1 free space 16226 generation 9 owner RAID_STRIPE_TREE
leaf 30638080 flags 0x1(WRITTEN) backref revision 1
checksum stored 9ed94b4d
checksum calced 9ed94b4d
fs uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
chunk uuid 463f7b1d-c0b8-4373-8334-52f5bf83475e
         item 0 key (298909696 RAID_STRIPE 65536) itemoff 16251 itemsize 32
                         stripe 0 devid 1 physical 298909696
                         stripe 1 devid 2 physical 277938176
total bytes 21474836480
bytes used 1212416
uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
Gerhard Wiesinger Sept. 19, 2024, 4:57 p.m. UTC | #3
On 18.09.2024 16:08, Johannes Thumshirn wrote:
> This is an RFC implementation of Qu's request to be able to
> distinguish preallocated extents in the stripe tree for scrub.
>
> It's not 100% working yet but only showing the basic "how it's going to
> look like".
>
> I'm not really sure this is a better solution than returning ENOENT
> and ignoring it in scrub.
>
> A third possibility would be to do a full backref walk on
> btrfs_map_block() error and then check if it's a preallocated extent.
>
> Johannes Thumshirn (2):
>    btrfs: introduce dummy RAID stripes for preallocated data
>    btrfs: insert dummy RAID stripe extents for preallocated data
>
>   fs/btrfs/inode.c                | 47 +++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.c     | 47 +++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.h     |  2 ++
>   include/uapi/linux/btrfs_tree.h |  1 +
>   4 files changed, 97 insertions(+)
>
Will this also solve the compression topic (see subject "BTRFS doesn't 
compress on the fly") for e.g. PostgreSQL (which preallocates)?

Thnx.

Ciao,

Gerhard
Johannes Thumshirn Sept. 20, 2024, 9:50 a.m. UTC | #4
On 19.09.24 19:06, Gerhard Wiesinger wrote:
>> Johannes Thumshirn (2):
>>     btrfs: introduce dummy RAID stripes for preallocated data
>>     btrfs: insert dummy RAID stripe extents for preallocated data
>>
>>    fs/btrfs/inode.c                | 47 +++++++++++++++++++++++++++++++++
>>    fs/btrfs/raid-stripe-tree.c     | 47 +++++++++++++++++++++++++++++++++
>>    fs/btrfs/raid-stripe-tree.h     |  2 ++
>>    include/uapi/linux/btrfs_tree.h |  1 +
>>    4 files changed, 97 insertions(+)
>>
> Will this also solve the compression topic (see subject "BTRFS doesn't
> compress on the fly") for e.g. PostgreSQL (which preallocates)?

No I'm afraid this is something completely different. This is about 
creating RAID stripe-tree entries. which noone should be running in any 
kind of a production environment anyways, as the feature is still very 
experimental.
Josef Bacik Sept. 23, 2024, 3:27 p.m. UTC | #5
On Wed, Sep 18, 2024 at 04:08:48PM +0200, Johannes Thumshirn wrote:
> This is an RFC implementation of Qu's request to be able to
> distinguish preallocated extents in the stripe tree for scrub.
> 
> It's not 100% working yet but only showing the basic "how it's going to
> look like".
> 
> I'm not really sure this is a better solution than returning ENOENT
> and ignoring it in scrub.
> 
> A third possibility would be to do a full backref walk on
> btrfs_map_block() error and then check if it's a preallocated extent.
> 
> Johannes Thumshirn (2):
>   btrfs: introduce dummy RAID stripes for preallocated data
>   btrfs: insert dummy RAID stripe extents for preallocated data
> 

I don't like this approach, I'd rather have a RST represent actually written
bytes on disk rather than adding entries to work around this particular case.

I think that -ENOENT from btrfs_map_block() is liable to result in weird
problems wif we return -ENOENT for other operations.  I think that you change it
to return -ENODATA so that we can for sure trace it back to RST, and use that to
indicate that we need to skip that extent.  This way we have something that is
unique to RST, and we don't have all these other entries that are unrelated to
the RST's job.  Thanks,

Josef