Message ID | 20240918140850.27261-1-jth@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Add dummy RAID stripe tree entries for PREALLOC data | expand |
在 2024/9/18 23:38, Johannes Thumshirn 写道: > This is an RFC implementation of Qu's request to be able to > distinguish preallocated extents in the stripe tree for scrub. > > It's not 100% working yet but only showing the basic "how it's going to > look like". > > I'm not really sure this is a better solution than returning ENOENT > and ignoring it in scrub. If RST without zoned supports preallocation and NOCOW, then I think we should not just insert a dummy, but a real RST entries. So that NOCOW/preallocated writes can just re-use the existing RST entry, without any new RST updates. At least logically it makes more sense. At least a quick glance into the handling shows, NOCOW writes doesn't trigger any RST updates, so I guess it should also apply to PREALLOCATED writes too. Or did I miss something else? Thanks, Qu > > A third possibility would be to do a full backref walk on > btrfs_map_block() error and then check if it's a preallocated extent. > > Johannes Thumshirn (2): > btrfs: introduce dummy RAID stripes for preallocated data > btrfs: insert dummy RAID stripe extents for preallocated data > > fs/btrfs/inode.c | 47 +++++++++++++++++++++++++++++++++ > fs/btrfs/raid-stripe-tree.c | 47 +++++++++++++++++++++++++++++++++ > fs/btrfs/raid-stripe-tree.h | 2 ++ > include/uapi/linux/btrfs_tree.h | 1 + > 4 files changed, 97 insertions(+) >
On 19.09.24 01:46, Qu Wenruo wrote: > 在 2024/9/18 23:38, Johannes Thumshirn 写道: >> This is an RFC implementation of Qu's request to be able to >> distinguish preallocated extents in the stripe tree for scrub. >> >> It's not 100% working yet but only showing the basic "how it's going to >> look like". >> >> I'm not really sure this is a better solution than returning ENOENT >> and ignoring it in scrub. > > If RST without zoned supports preallocation and NOCOW, then I think we > should not just insert a dummy, but a real RST entries. > > So that NOCOW/preallocated writes can just re-use the existing RST > entry, without any new RST updates. > > At least logically it makes more sense. > > At least a quick glance into the handling shows, NOCOW writes doesn't > trigger any RST updates, so I guess it should also apply to PREALLOCATED > writes too. Yup, got a fix for NOCOW. That again was a stupid mistake because I only had zoned in mind. For prealloc I don't this I should add RST entries, because there's nothing on disk for these ranges. It's pre-allocated after all. Once we write into the range, RST entries will be created: virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" test virtme-scsi:/mnt # btrfs ins dump-tree -t extent /dev/sda | grep -A 1 EXTENT_ITEM item 13 key (298844160 EXTENT_ITEM 1048576) itemoff 15828 itemsize 53 refs 1 gen 8 flags DATA virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda | grep -A 3 EXTENT_DATA item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 generation 8 type 2 (prealloc) prealloc data disk byte 298844160 nr 1048576 prealloc data offset 0 nr 1048576 virtme-scsi:/mnt # btrfs ins dump-tree -t raid-stripe /dev/sda btrfs-progs v6.9 raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) leaf 5259264 items 0 free space 16283 generation 3 owner RAID_STRIPE_TREE leaf 5259264 flags 0x1(WRITTEN) backref revision 1 checksum stored f54ade94 checksum calced f54ade94 fs uuid bebf1755-9379-4ac8-a623-ad0dc52641cf chunk uuid 463f7b1d-c0b8-4373-8334-52f5bf83475e total bytes 21474836480 bytes used 1212416 uuid bebf1755-9379-4ac8-a623-ad0dc52641cf virtme-scsi:/mnt # xfs_io -fc "pwrite 64k 64k" test wrote 65536/65536 bytes at offset 65536 64 KiB, 16 ops; 0.0003 sec (183.284 MiB/sec and 46920.8211 ops/sec) virtme-scsi:/mnt # btrfs ins dump-tree -t extent /dev/sda | grep -A 1 EXTENT_ITEM item 13 key (298844160 EXTENT_ITEM 1048576) itemoff 15828 itemsize 53 refs 3 gen 8 flags DATA virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda | grep -A 3 EXTENT_DATA item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 298844160 nr 1048576 prealloc data offset 0 nr 65536 item 7 key (257 EXTENT_DATA 65536) itemoff 15763 itemsize 53 generation 9 type 1 (regular) extent data disk byte 298844160 nr 1048576 extent data offset 65536 nr 65536 ram 1048576 -- item 8 key (257 EXTENT_DATA 131072) itemoff 15710 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 298844160 nr 1048576 prealloc data offset 131072 nr 917504 virtme-scsi:/mnt # btrfs ins dump-tree -t raid-stripe /dev/sda btrfs-progs v6.9 raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) leaf 30638080 items 1 free space 16226 generation 9 owner RAID_STRIPE_TREE leaf 30638080 flags 0x1(WRITTEN) backref revision 1 checksum stored 9ed94b4d checksum calced 9ed94b4d fs uuid bebf1755-9379-4ac8-a623-ad0dc52641cf chunk uuid 463f7b1d-c0b8-4373-8334-52f5bf83475e item 0 key (298909696 RAID_STRIPE 65536) itemoff 16251 itemsize 32 stripe 0 devid 1 physical 298909696 stripe 1 devid 2 physical 277938176 total bytes 21474836480 bytes used 1212416 uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
On 18.09.2024 16:08, Johannes Thumshirn wrote: > This is an RFC implementation of Qu's request to be able to > distinguish preallocated extents in the stripe tree for scrub. > > It's not 100% working yet but only showing the basic "how it's going to > look like". > > I'm not really sure this is a better solution than returning ENOENT > and ignoring it in scrub. > > A third possibility would be to do a full backref walk on > btrfs_map_block() error and then check if it's a preallocated extent. > > Johannes Thumshirn (2): > btrfs: introduce dummy RAID stripes for preallocated data > btrfs: insert dummy RAID stripe extents for preallocated data > > fs/btrfs/inode.c | 47 +++++++++++++++++++++++++++++++++ > fs/btrfs/raid-stripe-tree.c | 47 +++++++++++++++++++++++++++++++++ > fs/btrfs/raid-stripe-tree.h | 2 ++ > include/uapi/linux/btrfs_tree.h | 1 + > 4 files changed, 97 insertions(+) > Will this also solve the compression topic (see subject "BTRFS doesn't compress on the fly") for e.g. PostgreSQL (which preallocates)? Thnx. Ciao, Gerhard
On 19.09.24 19:06, Gerhard Wiesinger wrote: >> Johannes Thumshirn (2): >> btrfs: introduce dummy RAID stripes for preallocated data >> btrfs: insert dummy RAID stripe extents for preallocated data >> >> fs/btrfs/inode.c | 47 +++++++++++++++++++++++++++++++++ >> fs/btrfs/raid-stripe-tree.c | 47 +++++++++++++++++++++++++++++++++ >> fs/btrfs/raid-stripe-tree.h | 2 ++ >> include/uapi/linux/btrfs_tree.h | 1 + >> 4 files changed, 97 insertions(+) >> > Will this also solve the compression topic (see subject "BTRFS doesn't > compress on the fly") for e.g. PostgreSQL (which preallocates)? No I'm afraid this is something completely different. This is about creating RAID stripe-tree entries. which noone should be running in any kind of a production environment anyways, as the feature is still very experimental.
On Wed, Sep 18, 2024 at 04:08:48PM +0200, Johannes Thumshirn wrote: > This is an RFC implementation of Qu's request to be able to > distinguish preallocated extents in the stripe tree for scrub. > > It's not 100% working yet but only showing the basic "how it's going to > look like". > > I'm not really sure this is a better solution than returning ENOENT > and ignoring it in scrub. > > A third possibility would be to do a full backref walk on > btrfs_map_block() error and then check if it's a preallocated extent. > > Johannes Thumshirn (2): > btrfs: introduce dummy RAID stripes for preallocated data > btrfs: insert dummy RAID stripe extents for preallocated data > I don't like this approach, I'd rather have a RST represent actually written bytes on disk rather than adding entries to work around this particular case. I think that -ENOENT from btrfs_map_block() is liable to result in weird problems wif we return -ENOENT for other operations. I think that you change it to return -ENODATA so that we can for sure trace it back to RST, and use that to indicate that we need to skip that extent. This way we have something that is unique to RST, and we don't have all these other entries that are unrelated to the RST's job. Thanks, Josef