diff mbox series

btrfs: also add stripe entries for NOCOW writes

Message ID 20240923064549.14224-1-jth@kernel.org (mailing list archive)
State New, archived
Headers show
Series btrfs: also add stripe entries for NOCOW writes | expand

Commit Message

Johannes Thumshirn Sept. 23, 2024, 6:45 a.m. UTC
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

NOCOW writes do not generate stripe_extent entries in the RAID stripe
tree, as the RAID stripe-tree feature initially was designed with a
zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
writes. But the RAID stripe-tree feature is independent from the zoned
feature, so we must also allow NOCOW writes for zoned filesystems.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/inode.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Qu Wenruo Sept. 23, 2024, 7:28 a.m. UTC | #1
在 2024/9/23 16:15, Johannes Thumshirn 写道:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> NOCOW writes do not generate stripe_extent entries in the RAID stripe
> tree, as the RAID stripe-tree feature initially was designed with a
> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
> writes. But the RAID stripe-tree feature is independent from the zoned
> feature, so we must also allow NOCOW writes for zoned filesystems.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Sorry I'm going to repeat myself again, I still believe if we insert an
RST entry at falloc() time, it will be more consistent with the non-RST
code.

Yes, I known preallocated space will not need any read nor search RST
entry, and we just fill the page cache with zero at read time.

But the point of proper (not just dummy) RST entry for the whole
preallocated space is, we do not need to touch the RST entry anymore for
NOCOW/PREALLOCATED write at all.

This makes the RST NOCOW/PREALLOC writes behavior to align with the
non-RST code, which doesn't update any extent item, but only modify the
file extent for PREALLOC writes.

Thanks,
Qu

> ---
>   fs/btrfs/inode.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index edac499fd83d..c6e4b58c334c 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3111,6 +3111,11 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>   		ret = btrfs_update_inode_fallback(trans, inode);
>   		if (ret) /* -ENOMEM or corruption */
>   			btrfs_abort_transaction(trans, ret);
> +
> +		ret = btrfs_insert_raid_extent(trans, ordered_extent);
> +		if (ret)
> +			btrfs_abort_transaction(trans, ret);
> +
>   		goto out;
>   	}
>
Johannes Thumshirn Sept. 23, 2024, 7:40 a.m. UTC | #2
On 23.09.24 09:28, Qu Wenruo wrote:
> 
> 
> 在 2024/9/23 16:15, Johannes Thumshirn 写道:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> NOCOW writes do not generate stripe_extent entries in the RAID stripe
>> tree, as the RAID stripe-tree feature initially was designed with a
>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
>> writes. But the RAID stripe-tree feature is independent from the zoned
>> feature, so we must also allow NOCOW writes for zoned filesystems.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> Sorry I'm going to repeat myself again, I still believe if we insert an
> RST entry at falloc() time, it will be more consistent with the non-RST
> code.
> 
> Yes, I known preallocated space will not need any read nor search RST
> entry, and we just fill the page cache with zero at read time.
> 
> But the point of proper (not just dummy) RST entry for the whole
> preallocated space is, we do not need to touch the RST entry anymore for
> NOCOW/PREALLOCATED write at all.
> 
> This makes the RST NOCOW/PREALLOC writes behavior to align with the
> non-RST code, which doesn't update any extent item, but only modify the
> file extent for PREALLOC writes.

Please re-read the patch. This is not a dummy RST entry but a real RST 
entry for NOCOW writes.
Qu Wenruo Sept. 23, 2024, 7:56 a.m. UTC | #3
在 2024/9/23 17:10, Johannes Thumshirn 写道:
> On 23.09.24 09:28, Qu Wenruo wrote:
>>
>>
>> 在 2024/9/23 16:15, Johannes Thumshirn 写道:
>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe
>>> tree, as the RAID stripe-tree feature initially was designed with a
>>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
>>> writes. But the RAID stripe-tree feature is independent from the zoned
>>> feature, so we must also allow NOCOW writes for zoned filesystems.
>>>
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> Sorry I'm going to repeat myself again, I still believe if we insert an
>> RST entry at falloc() time, it will be more consistent with the non-RST
>> code.
>>
>> Yes, I known preallocated space will not need any read nor search RST
>> entry, and we just fill the page cache with zero at read time.
>>
>> But the point of proper (not just dummy) RST entry for the whole
>> preallocated space is, we do not need to touch the RST entry anymore for
>> NOCOW/PREALLOCATED write at all.
>>
>> This makes the RST NOCOW/PREALLOC writes behavior to align with the
>> non-RST code, which doesn't update any extent item, but only modify the
>> file extent for PREALLOC writes.
>
> Please re-read the patch. This is not a dummy RST entry but a real RST
> entry for NOCOW writes.
>
I know, but my point is, if the RST entry for preallocated range is
already a regular one, you won't even need to insert/update the RST tree
at all.

Just like we do not need to update the extent tree for
NOCOW/PREALLOCATED writes.

Thanks,
Qu
Johannes Thumshirn Sept. 23, 2024, 8:15 a.m. UTC | #4
On 23.09.24 09:56, Qu Wenruo wrote:
>>> 在 2024/9/23 16:15, Johannes Thumshirn 写道:
>>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>
>>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe
>>>> tree, as the RAID stripe-tree feature initially was designed with a
>>>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
>>>> writes. But the RAID stripe-tree feature is independent from the zoned
>>>> feature, so we must also allow NOCOW writes for zoned filesystems.
>>>>
>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> Sorry I'm going to repeat myself again, I still believe if we insert an
>>> RST entry at falloc() time, it will be more consistent with the non-RST
>>> code.
>>>
>>> Yes, I known preallocated space will not need any read nor search RST
>>> entry, and we just fill the page cache with zero at read time.
>>>
>>> But the point of proper (not just dummy) RST entry for the whole
>>> preallocated space is, we do not need to touch the RST entry anymore for
>>> NOCOW/PREALLOCATED write at all.
>>>
>>> This makes the RST NOCOW/PREALLOC writes behavior to align with the
>>> non-RST code, which doesn't update any extent item, but only modify the
>>> file extent for PREALLOC writes.
>>
>> Please re-read the patch. This is not a dummy RST entry but a real RST
>> entry for NOCOW writes.
>>
> I know, but my point is, if the RST entry for preallocated range is
> already a regular one, you won't even need to insert/update the RST tree
> at all.
> 
> Just like we do not need to update the extent tree for
> NOCOW/PREALLOCATED writes.

But as long as there is no data on disk there is no point of having a 
logical to N-disk physical mapping. We haven't even called 
btrfs_map_block() before, so where do we know which disks will get the 
blocks at which address? Look at this example:

Fallocate [0, 1M]
virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" -c sync test
virtme-scsi:/mnt # btrfs ins dump-tree -t extent /dev/sda | \
			grep -A 1 EXTENT_ITEM
         item 13 key (298844160 EXTENT_ITEM 1048576) itemoff 15828 
itemsize 53
                 refs 1 gen 8 flags DATA
virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda  | \
		grep -A 3 EXTENT_DATA
         item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
                 generation 8 type 2 (prealloc)
                 prealloc data disk byte 298844160 nr 1048576         [1]
                 prealloc data offset 0 nr 1048576

Write at [64k, 128k]
virtme-scsi:/mnt # xfs_io -fc "pwrite 64k 64k" test
wrote 65536/65536 bytes at offset 65536
64 KiB, 16 ops; 0.0003 sec (183.284 MiB/sec and 46920.8211 ops/sec)
virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda | grep -A 3 
EXTENT_DATA
         item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
                 generation 9 type 2 (prealloc)
                 prealloc data disk byte 298844160 nr 1048576
                 prealloc data offset 0 nr 65536
         item 7 key (257 EXTENT_DATA 65536) itemoff 15763 itemsize 53
                 generation 9 type 1 (regular)
                 extent data disk byte 298844160 nr 1048576 	     [2]
                 extent data offset 65536 nr 65536 ram 1048576
--
         item 8 key (257 EXTENT_DATA 131072) itemoff 15710 itemsize 53
                 generation 9 type 2 (prealloc)
                 prealloc data disk byte 298844160 nr 1048576
                 prealloc data offset 131072 nr 917504


virtme-scsi:/mnt # btrfs ins dump-tree -t raid-stripe /dev/sda
btrfs-progs v6.9
raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0)
leaf 30638080 items 1 free space 16226 generation 9 owner RAID_STRIPE_TREE
leaf 30638080 flags 0x1(WRITTEN) backref revision 1
checksum stored 9ed94b4d
checksum calced 9ed94b4d
fs uuid bebf1755-9379-4ac8-a623-ad0dc52641cf
chunk uuid 463f7b1d-c0b8-4373-8334-52f5bf83475e
         item 0 key (298909696 RAID_STRIPE 65536) itemoff 16251 itemsize 32
                         stripe 0 devid 1 physical 298909696          [3]
                         stripe 1 devid 2 physical 277938176



[1] we preallocate the data for [0, 1M] @ 298844160
[2] we have the actual written data for [64k, 128k] @ 298844160

What should I do to insert the RST entry there as we get:

[3] the physical copies starting at 298909696 on devid 1 and 277938176 
on devid 2
Qu Wenruo Sept. 23, 2024, 8:53 a.m. UTC | #5
在 2024/9/23 17:45, Johannes Thumshirn 写道:
> On 23.09.24 09:56, Qu Wenruo wrote:
>>>> 在 2024/9/23 16:15, Johannes Thumshirn 写道:
>>>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>>
>>>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe
>>>>> tree, as the RAID stripe-tree feature initially was designed with a
>>>>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
>>>>> writes. But the RAID stripe-tree feature is independent from the zoned
>>>>> feature, so we must also allow NOCOW writes for zoned filesystems.
>>>>>
>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>
>>>> Sorry I'm going to repeat myself again, I still believe if we insert an
>>>> RST entry at falloc() time, it will be more consistent with the non-RST
>>>> code.
>>>>
>>>> Yes, I known preallocated space will not need any read nor search RST
>>>> entry, and we just fill the page cache with zero at read time.
>>>>
>>>> But the point of proper (not just dummy) RST entry for the whole
>>>> preallocated space is, we do not need to touch the RST entry anymore for
>>>> NOCOW/PREALLOCATED write at all.
>>>>
>>>> This makes the RST NOCOW/PREALLOC writes behavior to align with the
>>>> non-RST code, which doesn't update any extent item, but only modify the
>>>> file extent for PREALLOC writes.
>>>
>>> Please re-read the patch. This is not a dummy RST entry but a real RST
>>> entry for NOCOW writes.
>>>
>> I know, but my point is, if the RST entry for preallocated range is
>> already a regular one, you won't even need to insert/update the RST tree
>> at all.
>>
>> Just like we do not need to update the extent tree for
>> NOCOW/PREALLOCATED writes.
>
> But as long as there is no data on disk there is no point of having a
> logical to N-disk physical mapping. We haven't even called
> btrfs_map_block() before, so where do we know which disks will get the
> blocks at which address? Look at this example:
>
> Fallocate [0, 1M]
> virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" -c sync test
[...]
>
>
> [1] we preallocate the data for [0, 1M] @ 298844160
> [2] we have the actual written data for [64k, 128k] @ 298844160
>
> What should I do to insert the RST entry there as we get:

Do the needed write map and insert the RST entries, just as if we're
doing a write, but without any actual IO.


Currently the handling of RST is not consistent with non-RST, thus
that's the reason causing problems with scrub on preallocated extents.

I known preallocated range won't trigger any read thus it makes no sense
to do the proper RST setup.
But let's also take the example of non-RST scrub:

Do we spend our time checking if a data extent is preallocated or not?
No, we do not. We just read the content, and check against its csum.
For preallocated extents, it just has no csum.

You can argue that this is wasting IO, but I can also counter-argue that
we need to make sure the read on that device range is fine, even if we
know we will not really read the content (but that's only for now).


Furthermore, from the logical aspect, at least to me, non-RST case is
just a special case where logical address is 1:1 mapped already.

This also means, even for preallocated extents, they should have RST
entries.


Finally, I do not think it's a good idea to insert RST entries for NOCOW.
If a file is set NOCOW, it means we'll doing a lot of overwrite for it.
Then why waste our time updating the RST entries again and again?

Isn't such behavior going to cause more write amplification? Meanwhile
for non-RST cases, NOCOW should cause the least amount of write
amplification.



So again, YES, I know doing proper write map and inserting RST entries
for preallocated range makes no sense for READ.
But preallocation and NOCOW is utilized for contents we frequently
over-writes, and such RST entries save us more for those frequently
over-writes.

Thanks,
Qu
>
> [3] the physical copies starting at 298909696 on devid 1 and 277938176
> on devid 2
>
>
Johannes Thumshirn Sept. 23, 2024, 2:41 p.m. UTC | #6
On 23.09.24 10:54, Qu Wenruo wrote:
> 
> 
> 在 2024/9/23 17:45, Johannes Thumshirn 写道:
>> On 23.09.24 09:56, Qu Wenruo wrote:
>>>>> 在 2024/9/23 16:15, Johannes Thumshirn 写道:
>>>>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>>>
>>>>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe
>>>>>> tree, as the RAID stripe-tree feature initially was designed with a
>>>>>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
>>>>>> writes. But the RAID stripe-tree feature is independent from the zoned
>>>>>> feature, so we must also allow NOCOW writes for zoned filesystems.
>>>>>>
>>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>>
>>>>> Sorry I'm going to repeat myself again, I still believe if we insert an
>>>>> RST entry at falloc() time, it will be more consistent with the non-RST
>>>>> code.
>>>>>
>>>>> Yes, I known preallocated space will not need any read nor search RST
>>>>> entry, and we just fill the page cache with zero at read time.
>>>>>
>>>>> But the point of proper (not just dummy) RST entry for the whole
>>>>> preallocated space is, we do not need to touch the RST entry anymore for
>>>>> NOCOW/PREALLOCATED write at all.
>>>>>
>>>>> This makes the RST NOCOW/PREALLOC writes behavior to align with the
>>>>> non-RST code, which doesn't update any extent item, but only modify the
>>>>> file extent for PREALLOC writes.
>>>>
>>>> Please re-read the patch. This is not a dummy RST entry but a real RST
>>>> entry for NOCOW writes.
>>>>
>>> I know, but my point is, if the RST entry for preallocated range is
>>> already a regular one, you won't even need to insert/update the RST tree
>>> at all.
>>>
>>> Just like we do not need to update the extent tree for
>>> NOCOW/PREALLOCATED writes.
>>
>> But as long as there is no data on disk there is no point of having a
>> logical to N-disk physical mapping. We haven't even called
>> btrfs_map_block() before, so where do we know which disks will get the
>> blocks at which address? Look at this example:
>>
>> Fallocate [0, 1M]
>> virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" -c sync test
> [...]
>>
>>
>> [1] we preallocate the data for [0, 1M] @ 298844160
>> [2] we have the actual written data for [64k, 128k] @ 298844160
>>
>> What should I do to insert the RST entry there as we get:
> 
> Do the needed write map and insert the RST entries, just as if we're
> doing a write, but without any actual IO.
> 
> 
> Currently the handling of RST is not consistent with non-RST, thus
> that's the reason causing problems with scrub on preallocated extents.
> 
> I known preallocated range won't trigger any read thus it makes no sense
> to do the proper RST setup.
> But let's also take the example of non-RST scrub:
> 
> Do we spend our time checking if a data extent is preallocated or not?
> No, we do not. We just read the content, and check against its csum.
> For preallocated extents, it just has no csum.
> 
> You can argue that this is wasting IO, but I can also counter-argue that
> we need to make sure the read on that device range is fine, even if we
> know we will not really read the content (but that's only for now).
> 
> 
> Furthermore, from the logical aspect, at least to me, non-RST case is
> just a special case where logical address is 1:1 mapped already.
> 
> This also means, even for preallocated extents, they should have RST
> entries.
> 
> 
> Finally, I do not think it's a good idea to insert RST entries for NOCOW.
> If a file is set NOCOW, it means we'll doing a lot of overwrite for it.
> Then why waste our time updating the RST entries again and again?
> 
> Isn't such behavior going to cause more write amplification? Meanwhile
> for non-RST cases, NOCOW should cause the least amount of write
> amplification.

The whole idea behind the RST was to write the RST entries _after_ the 
data has been persisted to disk. Otherwise we're back at the write hole 
problem. See for example this imaginary sequence:

Preallocate a range. This will then also preallocate the RST entries 
with the mapping as you describe. Write to it and while you write you 
have a powerloss. The copy/stripe to disk 1 is correctly written but 
disk 2 didn't report back before the power loss happened. After we have 
power again, a read to disk 2 comes in, as we have a RST entry, the read 
will be directed to the broken entry and garbage is returned. And this 
is the good case, as we can repair it.
If it was an overwrite of a block and the same happens, we have a RST 
entry pointing to a good and a bad copy.

Once we're adding the RST entries after both writes succeed the problem 
isn't there. So for preallocated extents it is even harmful to add a RST 
entry.
Josef Bacik Sept. 23, 2024, 3:20 p.m. UTC | #7
On Mon, Sep 23, 2024 at 04:58:34PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/9/23 16:15, Johannes Thumshirn 写道:
> > From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > 
> > NOCOW writes do not generate stripe_extent entries in the RAID stripe
> > tree, as the RAID stripe-tree feature initially was designed with a
> > zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
> > writes. But the RAID stripe-tree feature is independent from the zoned
> > feature, so we must also allow NOCOW writes for zoned filesystems.
> > 
> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> Sorry I'm going to repeat myself again, I still believe if we insert an
> RST entry at falloc() time, it will be more consistent with the non-RST
> code.
> 
> Yes, I known preallocated space will not need any read nor search RST
> entry, and we just fill the page cache with zero at read time.
> 
> But the point of proper (not just dummy) RST entry for the whole
> preallocated space is, we do not need to touch the RST entry anymore for
> NOCOW/PREALLOCATED write at all.
> 
> This makes the RST NOCOW/PREALLOC writes behavior to align with the
> non-RST code, which doesn't update any extent item, but only modify the
> file extent for PREALLOC writes.

I see what you're getting at here, but it creates a huge amount of problems
later down the line.

I prealloc an extent, I map that logical extent to a physical extent and then I
insert a RST entry for that mapping.  Now I rip out one of my disks, and now I
have to update RST entries for extents I'm not going to rewrite because they're
prealloc.

RST is a logical->physical mapping.  We do not need to update or insert anything
until we create that logical->physical mapping.  Keeping the rules consistent
across the different layers will make it easier to reason about and easier to
maintain.  Adding an index at endio time for NOCOW makes sense, we now have
created a thing on disk that we need to have a mapping for.  The same goes for
prealloc, adding an entry at prealloc time doesn't make logical sense as we
haven't yet instantiated that space on disk.  Thanks,

Josef
Qu Wenruo Sept. 23, 2024, 10:23 p.m. UTC | #8
在 2024/9/24 00:11, Johannes Thumshirn 写道:
> On 23.09.24 10:54, Qu Wenruo wrote:
>>
>>
[...]
>> Finally, I do not think it's a good idea to insert RST entries for NOCOW.
>> If a file is set NOCOW, it means we'll doing a lot of overwrite for it.
>> Then why waste our time updating the RST entries again and again?
>>
>> Isn't such behavior going to cause more write amplification? Meanwhile
>> for non-RST cases, NOCOW should cause the least amount of write
>> amplification.
>
> The whole idea behind the RST was to write the RST entries _after_ the
> data has been persisted to disk. Otherwise we're back at the write hole
> problem. See for example this imaginary sequence:
>
> Preallocate a range. This will then also preallocate the RST entries
> with the mapping as you describe. Write to it and while you write you
> have a powerloss. The copy/stripe to disk 1 is correctly written but
> disk 2 didn't report back before the power loss happened.
> After we have
> power again, a read to disk 2 comes in, as we have a RST entry, the read
> will be directed to the broken entry and garbage is returned. And this
> is the good case, as we can repair it.
> If it was an overwrite of a block and the same happens, we have a RST
> entry pointing to a good and a bad copy.

Nope, that will not happen.

Because our metadata is still COW protected, after such powerloss, the
file extent is still showing that range is PREALLOCATED, we won't even
trigger a read.

And this is exactly the same as the non-RST PREALLOCATED write.

>
> Once we're adding the RST entries after both writes succeed the problem
> isn't there. So for preallocated extents it is even harmful to add a RST
> entry.

You just forgot the metadata part, which prevents the problem from
happening in the very beginning.

Thanks,
Qu
Qu Wenruo Sept. 23, 2024, 10:32 p.m. UTC | #9
在 2024/9/24 00:50, Josef Bacik 写道:
> On Mon, Sep 23, 2024 at 04:58:34PM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/9/23 16:15, Johannes Thumshirn 写道:
>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe
>>> tree, as the RAID stripe-tree feature initially was designed with a
>>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
>>> writes. But the RAID stripe-tree feature is independent from the zoned
>>> feature, so we must also allow NOCOW writes for zoned filesystems.
>>>
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> Sorry I'm going to repeat myself again, I still believe if we insert an
>> RST entry at falloc() time, it will be more consistent with the non-RST
>> code.
>>
>> Yes, I known preallocated space will not need any read nor search RST
>> entry, and we just fill the page cache with zero at read time.
>>
>> But the point of proper (not just dummy) RST entry for the whole
>> preallocated space is, we do not need to touch the RST entry anymore for
>> NOCOW/PREALLOCATED write at all.
>>
>> This makes the RST NOCOW/PREALLOC writes behavior to align with the
>> non-RST code, which doesn't update any extent item, but only modify the
>> file extent for PREALLOC writes.
>
> I see what you're getting at here, but it creates a huge amount of problems
> later down the line.
>
> I prealloc an extent, I map that logical extent to a physical extent and then I
> insert a RST entry for that mapping.  Now I rip out one of my disks, and now I
> have to update RST entries for extents I'm not going to rewrite because they're
> prealloc.

Why do we even need to do anything update the RST entries?

RST is just an extra layer for logical bytenr mapping, and did you see
the non-RST btrfs do relocation just because one device went missing?

Can you explain more on the "have to update RST entries" part?
That mismatches from my understanding of RST.

>
> RST is a logical->physical mapping.  We do not need to update or insert anything
> until we create that logical->physical mapping.

Just consider the fallocate of non-RST as an example.

We DO allocate real data extents, they have real location on the disk.

Then add the RST layer. Now preallocated extent suddenly do not have RST
mapping, but still have extents allocated for them.

I do not think this is any more consistent.

>  Keeping the rules consistent
> across the different layers will make it easier to reason about and easier to
> maintain.

I think all data extents should have RST mapping, that's way more
consistent than two different handling for different data extents.

Just like we do not bother if a data extent is preallocated or not in scrub.


>  Adding an index at endio time for NOCOW makes sense, we now have
> created a thing on disk that we need to have a mapping for.  The same goes for
> prealloc, adding an entry at prealloc time doesn't make logical sense as we
> haven't yet instantiated that space on disk.  Thanks,

But we allocated data extents. Even if we won't really utilize them for
now, we should have RST for it.

In fact, I do not even think it's correct to insert/update RST at
endio/ordered extent time.

It will be way more consistent to update/insert RST entries at data
extent allocation time.

Thanks,
Qu

>
> Josef
>
Qu Wenruo Sept. 24, 2024, 12:46 a.m. UTC | #10
在 2024/9/24 08:02, Qu Wenruo 写道:
>
>
> 在 2024/9/24 00:50, Josef Bacik 写道:
>> On Mon, Sep 23, 2024 at 04:58:34PM +0930, Qu Wenruo wrote:
>>>
>>>
>>> 在 2024/9/23 16:15, Johannes Thumshirn 写道:
>>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>
>>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe
>>>> tree, as the RAID stripe-tree feature initially was designed with a
>>>> zoned filesystem in mind and on a zoned filesystem, we do not allow
>>>> NOCOW
>>>> writes. But the RAID stripe-tree feature is independent from the zoned
>>>> feature, so we must also allow NOCOW writes for zoned filesystems.
>>>>
>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> Sorry I'm going to repeat myself again, I still believe if we insert an
>>> RST entry at falloc() time, it will be more consistent with the non-RST
>>> code.
>>>
>>> Yes, I known preallocated space will not need any read nor search RST
>>> entry, and we just fill the page cache with zero at read time.
>>>
>>> But the point of proper (not just dummy) RST entry for the whole
>>> preallocated space is, we do not need to touch the RST entry anymore for
>>> NOCOW/PREALLOCATED write at all.
>>>
>>> This makes the RST NOCOW/PREALLOC writes behavior to align with the
>>> non-RST code, which doesn't update any extent item, but only modify the
>>> file extent for PREALLOC writes.
>>
>> I see what you're getting at here, but it creates a huge amount of
>> problems
>> later down the line.
>>
>> I prealloc an extent, I map that logical extent to a physical extent
>> and then I
>> insert a RST entry for that mapping.  Now I rip out one of my disks,
>> and now I
>> have to update RST entries for extents I'm not going to rewrite
>> because they're
>> prealloc.
>
> Why do we even need to do anything update the RST entries?
>
> RST is just an extra layer for logical bytenr mapping, and did you see
> the non-RST btrfs do relocation just because one device went missing?
>
> Can you explain more on the "have to update RST entries" part?
> That mismatches from my understanding of RST.
>
>>
>> RST is a logical->physical mapping.  We do not need to update or
>> insert anything
>> until we create that logical->physical mapping.
>
> Just consider the fallocate of non-RST as an example.
>
> We DO allocate real data extents, they have real location on the disk.
>
> Then add the RST layer. Now preallocated extent suddenly do not have RST
> mapping, but still have extents allocated for them.
>
> I do not think this is any more consistent.
>
>>  Keeping the rules consistent
>> across the different layers will make it easier to reason about and
>> easier to
>> maintain.
>
> I think all data extents should have RST mapping, that's way more
> consistent than two different handling for different data extents.
>
> Just like we do not bother if a data extent is preallocated or not in
> scrub.
>
>
>>  Adding an index at endio time for NOCOW makes sense, we now have
>> created a thing on disk that we need to have a mapping for.  The same
>> goes for
>> prealloc, adding an entry at prealloc time doesn't make logical sense
>> as we
>> haven't yet instantiated that space on disk.  Thanks,
>
> But we allocated data extents. Even if we won't really utilize them for
> now, we should have RST for it.
>
> In fact, I do not even think it's correct to insert/update RST at
> endio/ordered extent time.
>
> It will be way more consistent to update/insert RST entries at data
> extent allocation time.

My bad, this is impossible for zoned devices, and that's why RST is here
to help the situation.

So this indeed means we have to do the work at endio/finish_ordered_io()
time.

And in that case, it looks like we really have to handle preallocated
extents differently (thanks to these zoned devices)

Thanks,
Qu
>
> Thanks,
> Qu
>
>>
>> Josef
>>
>
>
Naohiro Aota Sept. 24, 2024, 7:07 a.m. UTC | #11
On Mon, Sep 23, 2024 at 08:45:47AM GMT, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> NOCOW writes do not generate stripe_extent entries in the RAID stripe
> tree, as the RAID stripe-tree feature initially was designed with a
> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
> writes. But the RAID stripe-tree feature is independent from the zoned
> feature, so we must also allow NOCOW writes for zoned filesystems.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

The NCOW case itself looks fine for me.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

But, looking around the context, since there are some duplication among
regular and NOCOW case, it would be good time to merge them.

Apparently, btrfs_insert_raid_extent() in the regular case (current code)
should abort the transaction as same as this patch. With that fixed, code
will become really similar.

> ---
>  fs/btrfs/inode.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index edac499fd83d..c6e4b58c334c 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3111,6 +3111,11 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>  		ret = btrfs_update_inode_fallback(trans, inode);
>  		if (ret) /* -ENOMEM or corruption */
>  			btrfs_abort_transaction(trans, ret);
> +
> +		ret = btrfs_insert_raid_extent(trans, ordered_extent);
> +		if (ret)
> +			btrfs_abort_transaction(trans, ret);
> +
>  		goto out;
>  	}
>  
> -- 
> 2.43.0
>
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index edac499fd83d..c6e4b58c334c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3111,6 +3111,11 @@  int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
 		ret = btrfs_update_inode_fallback(trans, inode);
 		if (ret) /* -ENOMEM or corruption */
 			btrfs_abort_transaction(trans, ret);
+
+		ret = btrfs_insert_raid_extent(trans, ordered_extent);
+		if (ret)
+			btrfs_abort_transaction(trans, ret);
+
 		goto out;
 	}