diff mbox series

btrfs: zoned: limit ordered extent to zoned append size

Message ID 65f1b716324a06c5cad99f2737a8669899d4569f.1621588229.git.johannes.thumshirn@wdc.com (mailing list archive)
State New, archived
Headers show
Series btrfs: zoned: limit ordered extent to zoned append size | expand

Commit Message

Johannes Thumshirn May 21, 2021, 9:11 a.m. UTC
Damien reported a test failure with btrfs/209. The test itself ran fine,
but the fsck run afterwards reported a corrupted filesystem.

The filesystem corruption happens because we're splitting an extent and
then writing the extent twice. We have to split the extent though, because
we're creating too large extents for a REQ_OP_ZONE_APPEND operation.

When dumping the extent tree, we can see two EXTENT_ITEMs at the same
start address but different lengths.

$ btrfs inspect dump-tree /dev/nullb1 -t extent
...
   item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
           refs 1 gen 7 flags DATA
           extent data backref root FS_TREE objectid 257 offset 786432 count 1
   item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
           refs 1 gen 7 flags DATA
           extent data backref root FS_TREE objectid 257 offset 786432 count 1

On a zoned filesystem, limit the size of an ordered extent to the maximum
size that can be issued as a single REQ_OP_ZONE_APPEND operation.

Note: This patch breaks fstests btrfs/079, as it increases the number of
on-disk extents from 80 to 83 per 10M write.

Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/extent_io.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Damien Le Moal May 21, 2021, 9:46 a.m. UTC | #1
On 2021/05/21 18:11, Johannes Thumshirn wrote:
> Damien reported a test failure with btrfs/209. The test itself ran fine,
> but the fsck run afterwards reported a corrupted filesystem.
> 
> The filesystem corruption happens because we're splitting an extent and
> then writing the extent twice. We have to split the extent though, because
> we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
> 
> When dumping the extent tree, we can see two EXTENT_ITEMs at the same
> start address but different lengths.
> 
> $ btrfs inspect dump-tree /dev/nullb1 -t extent
> ...
>    item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
>    item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
> 
> On a zoned filesystem, limit the size of an ordered extent to the maximum
> size that can be issued as a single REQ_OP_ZONE_APPEND operation.
> 
> Note: This patch breaks fstests btrfs/079, as it increases the number of
> on-disk extents from 80 to 83 per 10M write.

Can this test case be fixed by calculating the number of extents that will be
written using sysfs zone_append_max_bytes ? That would avoid hard-coding a value
for the zoned case...

> 
> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/extent_io.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 78d3f2ec90e0..e823b2c74af5 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  				    u64 *end)
>  {
>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
>  	u64 delalloc_start;
>  	u64 delalloc_end;
> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  	int ret;
>  	int loops = 0;
>  
> +	if (fs_info && fs_info->max_zone_append_size)
> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
> +				       PAGE_SIZE);
>  again:
>  	/* step one, find a bunch of delalloc bytes starting at start */
>  	delalloc_start = *start;
>
Johannes Thumshirn May 21, 2021, 9:52 a.m. UTC | #2
On 21/05/2021 11:46, Damien Le Moal wrote:
>> Note: This patch breaks fstests btrfs/079, as it increases the number of
>> on-disk extents from 80 to 83 per 10M write.
> Can this test case be fixed by calculating the number of extents that will be
> written using sysfs zone_append_max_bytes ? That would avoid hard-coding a value
> for the zoned case... 

Probably yes, but I'd like to hear from others how important they see the hard coded
value.
David Sterba May 21, 2021, 4:37 p.m. UTC | #3
On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote:
> Damien reported a test failure with btrfs/209. The test itself ran fine,
> but the fsck run afterwards reported a corrupted filesystem.
> 
> The filesystem corruption happens because we're splitting an extent and
> then writing the extent twice. We have to split the extent though, because
> we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
> 
> When dumping the extent tree, we can see two EXTENT_ITEMs at the same
> start address but different lengths.
> 
> $ btrfs inspect dump-tree /dev/nullb1 -t extent
> ...
>    item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
>    item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
> 
> On a zoned filesystem, limit the size of an ordered extent to the maximum
> size that can be issued as a single REQ_OP_ZONE_APPEND operation.
> 
> Note: This patch breaks fstests btrfs/079, as it increases the number of
> on-disk extents from 80 to 83 per 10M write.
> 
> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/extent_io.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 78d3f2ec90e0..e823b2c74af5 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  				    u64 *end)
>  {
>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
>  	u64 delalloc_start;
>  	u64 delalloc_end;
> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  	int ret;
>  	int loops = 0;
>  
> +	if (fs_info && fs_info->max_zone_append_size)
> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
> +				       PAGE_SIZE);

Why is the alignment needed? Are the max zone append values expected to
be so random? Also it's using memory-related value for something that's
more hw related, or at least extent size (which ends up on disk).

>  again:
>  	/* step one, find a bunch of delalloc bytes starting at start */
>  	delalloc_start = *start;
> -- 
> 2.31.1
Damien Le Moal May 23, 2021, 11:05 p.m. UTC | #4
On 2021/05/22 1:39, David Sterba wrote:
> On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote:
>> Damien reported a test failure with btrfs/209. The test itself ran fine,
>> but the fsck run afterwards reported a corrupted filesystem.
>>
>> The filesystem corruption happens because we're splitting an extent and
>> then writing the extent twice. We have to split the extent though, because
>> we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
>>
>> When dumping the extent tree, we can see two EXTENT_ITEMs at the same
>> start address but different lengths.
>>
>> $ btrfs inspect dump-tree /dev/nullb1 -t extent
>> ...
>>    item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
>>            refs 1 gen 7 flags DATA
>>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
>>    item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
>>            refs 1 gen 7 flags DATA
>>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
>>
>> On a zoned filesystem, limit the size of an ordered extent to the maximum
>> size that can be issued as a single REQ_OP_ZONE_APPEND operation.
>>
>> Note: This patch breaks fstests btrfs/079, as it increases the number of
>> on-disk extents from 80 to 83 per 10M write.
>>
>> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>>  fs/btrfs/extent_io.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 78d3f2ec90e0..e823b2c74af5 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>>  				    u64 *end)
>>  {
>>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
>> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
>>  	u64 delalloc_start;
>>  	u64 delalloc_end;
>> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>>  	int ret;
>>  	int loops = 0;
>>  
>> +	if (fs_info && fs_info->max_zone_append_size)
>> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
>> +				       PAGE_SIZE);
> 
> Why is the alignment needed? Are the max zone append values expected to
> be so random? Also it's using memory-related value for something that's
> more hw related, or at least extent size (which ends up on disk).

It is similar to max_hw_sectors: the hardware decides what the value is. So we
cannot assume anything about what max_zone_append_size is.

I think that Johannes patch here limits the extent size to the HW value to avoid
having to split the extent later one. That is efficient but indeed is a bit of a
layering violation here.

> 
>>  again:
>>  	/* step one, find a bunch of delalloc bytes starting at start */
>>  	delalloc_start = *start;
>> -- 
>> 2.31.1
>
Johannes Thumshirn May 25, 2021, 9:40 a.m. UTC | #5
On 24/05/2021 01:05, Damien Le Moal wrote: 
>>> +	if (fs_info && fs_info->max_zone_append_size)
>>> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
>>> +				       PAGE_SIZE);
>>
>> Why is the alignment needed? Are the max zone append values expected to
>> be so random? Also it's using memory-related value for something that's
>> more hw related, or at least extent size (which ends up on disk).

I did the ALIGN_DOWN() call because we want to have complete pages added.

> It is similar to max_hw_sectors: the hardware decides what the value is. So we
> cannot assume anything about what max_zone_append_size is.
> 
> I think that Johannes patch here limits the extent size to the HW value to avoid
> having to split the extent later one. That is efficient but indeed is a bit of a
> layering violation here.

Damien just brought up a good idea: what about a function to lookup the max extent
size depending on the block group. For regular btrfs it'll for now just return 
BTRFS_MAX_EXTENT_SIZE, for zoned btrfs it'll return 
ALIGN_DOWN(fs_info->max_zone_append_size, PAGE_SIZE) and it also gives us some 
headroom for future improvements in this area.
David Sterba May 31, 2021, 6:58 p.m. UTC | #6
On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote:
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  				    u64 *end)
>  {
>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
>  	u64 delalloc_start;
>  	u64 delalloc_end;
> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  	int ret;
>  	int loops = 0;
>  
> +	if (fs_info && fs_info->max_zone_append_size)

Do you really need to check for a valid fs_info? It's derived from an
inode so it must be valid or something is seriously wrong.

> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
> +				       PAGE_SIZE);

Right now the page alignment sounds ok because the delalloc code works
on page granularity. There's the implicit assumpption that data blocks
are page-sized, but the whole delalloc engine works on pages so no
reason to use anything else.
David Sterba May 31, 2021, 7:02 p.m. UTC | #7
On Tue, May 25, 2021 at 09:40:22AM +0000, Johannes Thumshirn wrote:
> On 24/05/2021 01:05, Damien Le Moal wrote: 
> >>> +	if (fs_info && fs_info->max_zone_append_size)
> >>> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
> >>> +				       PAGE_SIZE);
> >>
> >> Why is the alignment needed? Are the max zone append values expected to
> >> be so random? Also it's using memory-related value for something that's
> >> more hw related, or at least extent size (which ends up on disk).
> 
> I did the ALIGN_DOWN() call because we want to have complete pages added.
> 
> > It is similar to max_hw_sectors: the hardware decides what the value is. So we
> > cannot assume anything about what max_zone_append_size is.
> > 
> > I think that Johannes patch here limits the extent size to the HW value to avoid
> > having to split the extent later one. That is efficient but indeed is a bit of a
> > layering violation here.
> 
> Damien just brought up a good idea: what about a function to lookup the max extent
> size depending on the block group. For regular btrfs it'll for now just return 
> BTRFS_MAX_EXTENT_SIZE, for zoned btrfs it'll return 
> ALIGN_DOWN(fs_info->max_zone_append_size, PAGE_SIZE) and it also gives us some 
> headroom for future improvements in this area.

Hm, right that sounds safer. I've grepped for BTRFS_MAX_EXTENT_SIZE and
it's used in many places so it's not just the one you fixed. If the
maximum extent size is really limited by max_zone_append it needs to be
used consistently everywhere, thus needing a helper.
Johannes Thumshirn June 1, 2021, 7:44 a.m. UTC | #8
On 31/05/2021 21:01, David Sterba wrote:
> On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote:
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>>  				    u64 *end)
>>  {
>>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
>> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
>>  	u64 delalloc_start;
>>  	u64 delalloc_end;
>> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>>  	int ret;
>>  	int loops = 0;
>>  
>> +	if (fs_info && fs_info->max_zone_append_size)
> 
> Do you really need to check for a valid fs_info? It's derived from an
> inode so it must be valid or something is seriously wrong.

I thought it was because some selftest tripped over a NULL pointer, but it looks 
very much like cargo cult. I'll recheck.

> 
>> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
>> +				       PAGE_SIZE);
> 
> Right now the page alignment sounds ok because the delalloc code works
> on page granularity. There's the implicit assumpption that data blocks
> are page-sized, but the whole delalloc engine works on pages so no
> reason to use anything else.
>
David Sterba June 1, 2021, 6:58 p.m. UTC | #9
On Tue, Jun 01, 2021 at 07:44:29AM +0000, Johannes Thumshirn wrote:
> On 31/05/2021 21:01, David Sterba wrote:
> > On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote:
> >> --- a/fs/btrfs/extent_io.c
> >> +++ b/fs/btrfs/extent_io.c
> >> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
> >>  				    u64 *end)
> >>  {
> >>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> >> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
> >>  	u64 delalloc_start;
> >>  	u64 delalloc_end;
> >> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
> >>  	int ret;
> >>  	int loops = 0;
> >>  
> >> +	if (fs_info && fs_info->max_zone_append_size)
> > 
> > Do you really need to check for a valid fs_info? It's derived from an
> > inode so it must be valid or something is seriously wrong.
> 
> I thought it was because some selftest tripped over a NULL pointer, but it looks 
> very much like cargo cult. I'll recheck.

Ah right, self tests have some exceptions regarding fsinfo though IIRC
ther's always some but maybe not fully setup or with some artifical
nodesize etc to test the structures that are just in memory.

I'd rather not pull selftets-specific code so if it really crashed in
tests then let's please fix the tests.
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 78d3f2ec90e0..e823b2c74af5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1860,6 +1860,7 @@  noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 				    u64 *end)
 {
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
 	u64 delalloc_start;
 	u64 delalloc_end;
@@ -1868,6 +1869,9 @@  noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 	int ret;
 	int loops = 0;
 
+	if (fs_info && fs_info->max_zone_append_size)
+		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
+				       PAGE_SIZE);
 again:
 	/* step one, find a bunch of delalloc bytes starting at start */
 	delalloc_start = *start;