mbox series

[RFC,0/2] Add dummy RAID stripe tree entries for PREALLOC data

Message ID 20240918140850.27261-1-jth@kernel.org (mailing list archive)
Headers show
Series Add dummy RAID stripe tree entries for PREALLOC data | expand

Message

Johannes Thumshirn Sept. 18, 2024, 2:08 p.m. UTC
This is an RFC implementation of Qu's request to be able to
distinguish preallocated extents in the stripe tree for scrub.

It's not 100% working yet but only showing the basic "how it's going to
look like".

I'm not really sure this is a better solution than returning ENOENT
and ignoring it in scrub.

A third possibility would be to do a full backref walk on
btrfs_map_block() error and then check if it's a preallocated extent.

Johannes Thumshirn (2):
  btrfs: introduce dummy RAID stripes for preallocated data
  btrfs: insert dummy RAID stripe extents for preallocated data

 fs/btrfs/inode.c                | 47 +++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.c     | 47 +++++++++++++++++++++++++++++++++
 fs/btrfs/raid-stripe-tree.h     |  2 ++
 include/uapi/linux/btrfs_tree.h |  1 +
 4 files changed, 97 insertions(+)

Comments

Qu Wenruo Sept. 18, 2024, 11:45 p.m. UTC | #1
在 2024/9/18 23:38, Johannes Thumshirn 写道:
> This is an RFC implementation of Qu's request to be able to
> distinguish preallocated extents in the stripe tree for scrub.
>
> It's not 100% working yet but only showing the basic "how it's going to
> look like".
>
> I'm not really sure this is a better solution than returning ENOENT
> and ignoring it in scrub.

If RST without zoned supports preallocation and NOCOW, then I think we
should not just insert a dummy, but a real RST entries.

So that NOCOW/preallocated writes can just re-use the existing RST
entry, without any new RST updates.

At least logically it makes more sense.

At least a quick glance into the handling shows, NOCOW writes doesn't
trigger any RST updates, so I guess it should also apply to PREALLOCATED
writes too.

Or did I miss something else?

Thanks,
Qu
>
> A third possibility would be to do a full backref walk on
> btrfs_map_block() error and then check if it's a preallocated extent.
>
> Johannes Thumshirn (2):
>    btrfs: introduce dummy RAID stripes for preallocated data
>    btrfs: insert dummy RAID stripe extents for preallocated data
>
>   fs/btrfs/inode.c                | 47 +++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.c     | 47 +++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.h     |  2 ++
>   include/uapi/linux/btrfs_tree.h |  1 +
>   4 files changed, 97 insertions(+)
>
Johannes Thumshirn Sept. 19, 2024, 3:42 p.m. UTC | #2
On 19.09.24 01:46, Qu Wenruo wrote:
> 在 2024/9/18 23:38, Johannes Thumshirn 写道:
>> This is an RFC implementation of Qu's request to be able to
>> distinguish preallocated extents in the stripe tree for scrub.
>>
>> It's not 100% working yet but only showing the basic "how it's going to
>> look like".
>>
>> I'm not really sure this is a better solution than returning ENOENT
>> and ignoring it in scrub.
> 
> If RST without zoned supports preallocation and NOCOW, then I think we
> should not just insert a dummy, but a real RST entries.
> 
> So that NOCOW/preallocated writes can just re-use the existing RST
> entry, without any new RST updates.
> 
> At least logically it makes more sense.
> 
> At least a quick glance into the handling shows, NOCOW writes doesn't
> trigger any RST updates, so I guess it should also apply to PREALLOCATED
> writes too.

Yup, got a fix for NOCOW. That again was a stupid mistake because I only 
had zoned in mind.

For prealloc I don't this I should add RST entries, because there's 
nothing on disk for these ranges. It's pre-allocated after all.

Once we write into the range, RST entries will be created:

virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" test
virtme-scsi:/mnt # btrfs ins dump-tree -t extent /dev/sda | grep -A 1 
EXTENT_ITEM
         item 13 key (298844160 EXTENT_ITEM 1048576) itemoff 15828 
itemsize 53
                 refs 1 gen 8 flags DATA
virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda  | grep -A 3 
EXTENT_DATA
         item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
                 generation 8 type 2 (prealloc)
                 prealloc data disk byte 298844160 nr 1048576
                 prealloc data offset 0 nr 1048576
virtme-scsi:/mnt # btrfs ins dump-tree -t raid-stripe /dev/sda
btrfs-progs v6.9
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 5259264 items 0 free space 16283 generation 3 owner RAID_STRIPE_TREE
leaf 5259264 flags 0x1(WRITTEN) backref revision 1
checksum stored f54ade94
checksum calced f54ade94
fs uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
chunk uuid 463f7b1d-c0b8-4373-8334-52f5bf83475e
total bytes 21474836480
bytes used 1212416
uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
virtme-scsi:/mnt # xfs_io -fc "pwrite 64k 64k" test
wrote 65536/65536 bytes at offset 65536
64 KiB, 16 ops; 0.0003 sec (183.284 MiB/sec and 46920.8211 ops/sec)
virtme-scsi:/mnt # btrfs ins dump-tree -t extent /dev/sda | grep -A 1 
EXTENT_ITEM
         item 13 key (298844160 EXTENT_ITEM 1048576) itemoff 15828 
itemsize 53
                 refs 3 gen 8 flags DATA
virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda | grep -A 3 
EXTENT_DATA
         item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
                 generation 9 type 2 (prealloc)
                 prealloc data disk byte 298844160 nr 1048576
                 prealloc data offset 0 nr 65536
         item 7 key (257 EXTENT_DATA 65536) itemoff 15763 itemsize 53
                 generation 9 type 1 (regular)
                 extent data disk byte 298844160 nr 1048576
                 extent data offset 65536 nr 65536 ram 1048576
--
         item 8 key (257 EXTENT_DATA 131072) itemoff 15710 itemsize 53
                 generation 9 type 2 (prealloc)
                 prealloc data disk byte 298844160 nr 1048576
                 prealloc data offset 131072 nr 917504
virtme-scsi:/mnt # btrfs ins dump-tree -t raid-stripe /dev/sda
btrfs-progs v6.9
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 30638080 items 1 free space 16226 generation 9 owner RAID_STRIPE_TREE
leaf 30638080 flags 0x1(WRITTEN) backref revision 1
checksum stored 9ed94b4d
checksum calced 9ed94b4d
fs uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
chunk uuid 463f7b1d-c0b8-4373-8334-52f5bf83475e
         item 0 key (298909696 RAID_STRIPE 65536) itemoff 16251 itemsize 32
                         stripe 0 devid 1 physical 298909696
                         stripe 1 devid 2 physical 277938176
total bytes 21474836480
bytes used 1212416
uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
Gerhard Wiesinger Sept. 19, 2024, 4:57 p.m. UTC | #3
On 18.09.2024 16:08, Johannes Thumshirn wrote:
> This is an RFC implementation of Qu's request to be able to
> distinguish preallocated extents in the stripe tree for scrub.
>
> It's not 100% working yet but only showing the basic "how it's going to
> look like".
>
> I'm not really sure this is a better solution than returning ENOENT
> and ignoring it in scrub.
>
> A third possibility would be to do a full backref walk on
> btrfs_map_block() error and then check if it's a preallocated extent.
>
> Johannes Thumshirn (2):
>    btrfs: introduce dummy RAID stripes for preallocated data
>    btrfs: insert dummy RAID stripe extents for preallocated data
>
>   fs/btrfs/inode.c                | 47 +++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.c     | 47 +++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.h     |  2 ++
>   include/uapi/linux/btrfs_tree.h |  1 +
>   4 files changed, 97 insertions(+)
>
Will this also solve the compression topic (see subject "BTRFS doesn't 
compress on the fly") for e.g. PostgreSQL (which preallocates)?

Thnx.

Ciao,

Gerhard