btrfs: zoned: limit ordered extent to zoned append size

Message ID	65f1b716324a06c5cad99f2737a8669899d4569f.1621588229.git.johannes.thumshirn@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> IronPort-SDR: fOnyhJKWCa2FfzHpcJJ5UySf/LVlUL2UQNwJ6bWujQ0amO6iszz6jiJF9CMA33hcztDQADbmO+ LFxMoxy7vs6RiGRlYYFv5I5BYPy9gXm9ogxSv6c0MnaPTnxDyhzXmb88GV80l9/Cp8ch/hTB6a HQpG/wZgvA7Ac8frgVAsoeXJVu9ySjAto7UJpoc3FKCSEu5jbscprZdU6JyFoSVzmd4K/Akbbs kdNFLQrnrGH84kjiejyMSvGiyGKanZecmnvYECB6X4O1gcGsR37BBQ+jpfuE1VD3OxPRZfx0NF TaQ= IronPort-SDR: /gMg0uJvtduQSTzHJogxBiYM4/G3aWQ48Pn5n0hDxO0J/aCbjmLG9VAYx1KIV/g/S1iE8vIras pwJ1OXGptXdp7e4mkieNhqbwEp8vsEjpQEHFdDQfN1WbWruV0FkCRwt34tyqQ75rVZADwCY4/8 FW/X49EHpq9GUkYuYxM9fm8xsvPozQSHjdZY0u28imBGVkwyo90Mb5UYIxTbEXqhh9uQHHunDZ OQs1uKT3LD+U628TBeY7VFwj8KCzJlRH1whr7KN+DlqCetpeG7+WduaSvnVa0UXbiJ8pemW03p myHa6WJ3BFrSv5WLwgPvWdE1 IronPort-SDR: zJrcRn5jiXmQ0KGdZ+PxxhMPk0uyjyodNq4mtB0HuJqE2egVAsZY+MmP33XeRp1nUddJHAXwdY Dmwc2pBkWZBoa8AIP+p3AhQ7cCl1BxGhJ5nxErWRo6vhnMt+Mxjv/7LNxdnGqQ8FzspM7gUcxa qlCjje5OFGGPAZCuoUAEoU/fHtOCOUYXec4NHEtyd2PH3e6k2hK27Bq51ip9iKZG6h6qgCO018 83D0xObmsLtdB2ciMvFKwDpk1UmOcwvQiRP8jDwTzJ9X1DIcG4m2AguV8odeuRp+RU0d7wMsB9 v4g= WDCIronportException: Internal From: Johannes Thumshirn <johannes.thumshirn@wdc.com> To: David Sterba <dsterba@suse.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>, linux-btrfs@vger.kernel.org, Naohiro Aota <naohiro.aota@wdc.com>, Damien Le Moal <damien.lemoal@wdc.com> Subject: [PATCH] btrfs: zoned: limit ordered extent to zoned append size Date: Fri, 21 May 2021 18:11:04 +0900 Message-Id: <65f1b716324a06c5cad99f2737a8669899d4569f.1621588229.git.johannes.thumshirn@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: zoned: limit ordered extent to zoned append size \| expand btrfs: zoned: limit ordered extent to zoned append size

Johannes Thumshirn May 21, 2021, 9:11 a.m. UTC

Damien reported a test failure with btrfs/209. The test itself ran fine,
but the fsck run afterwards reported a corrupted filesystem.

The filesystem corruption happens because we're splitting an extent and
then writing the extent twice. We have to split the extent though, because
we're creating too large extents for a REQ_OP_ZONE_APPEND operation.

When dumping the extent tree, we can see two EXTENT_ITEMs at the same
start address but different lengths.

$ btrfs inspect dump-tree /dev/nullb1 -t extent
...
   item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
           refs 1 gen 7 flags DATA
           extent data backref root FS_TREE objectid 257 offset 786432 count 1
   item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
           refs 1 gen 7 flags DATA
           extent data backref root FS_TREE objectid 257 offset 786432 count 1

On a zoned filesystem, limit the size of an ordered extent to the maximum
size that can be issued as a single REQ_OP_ZONE_APPEND operation.

Note: This patch breaks fstests btrfs/079, as it increases the number of
on-disk extents from 80 to 83 per 10M write.

Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/extent_io.c | 4 ++++
 1 file changed, 4 insertions(+)

Damien Le Moal May 21, 2021, 9:46 a.m. UTC | #1

On 2021/05/21 18:11, Johannes Thumshirn wrote:
> Damien reported a test failure with btrfs/209. The test itself ran fine,
> but the fsck run afterwards reported a corrupted filesystem.
> 
> The filesystem corruption happens because we're splitting an extent and
> then writing the extent twice. We have to split the extent though, because
> we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
> 
> When dumping the extent tree, we can see two EXTENT_ITEMs at the same
> start address but different lengths.
> 
> $ btrfs inspect dump-tree /dev/nullb1 -t extent
> ...
>    item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
>    item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
> 
> On a zoned filesystem, limit the size of an ordered extent to the maximum
> size that can be issued as a single REQ_OP_ZONE_APPEND operation.
> 
> Note: This patch breaks fstests btrfs/079, as it increases the number of
> on-disk extents from 80 to 83 per 10M write.

Can this test case be fixed by calculating the number of extents that will be
written using sysfs zone_append_max_bytes ? That would avoid hard-coding a value
for the zoned case...

> 
> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/extent_io.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 78d3f2ec90e0..e823b2c74af5 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  				    u64 *end)
>  {
>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
>  	u64 delalloc_start;
>  	u64 delalloc_end;
> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  	int ret;
>  	int loops = 0;
>  
> +	if (fs_info && fs_info->max_zone_append_size)
> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
> +				       PAGE_SIZE);
>  again:
>  	/* step one, find a bunch of delalloc bytes starting at start */
>  	delalloc_start = *start;
>

Johannes Thumshirn May 21, 2021, 9:52 a.m. UTC | #2

On 21/05/2021 11:46, Damien Le Moal wrote:
>> Note: This patch breaks fstests btrfs/079, as it increases the number of
>> on-disk extents from 80 to 83 per 10M write.
> Can this test case be fixed by calculating the number of extents that will be
> written using sysfs zone_append_max_bytes ? That would avoid hard-coding a value
> for the zoned case... 

Probably yes, but I'd like to hear from others how important they see the hard coded
value.

David Sterba May 21, 2021, 4:37 p.m. UTC | #3

On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote:
> Damien reported a test failure with btrfs/209. The test itself ran fine,
> but the fsck run afterwards reported a corrupted filesystem.
> 
> The filesystem corruption happens because we're splitting an extent and
> then writing the extent twice. We have to split the extent though, because
> we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
> 
> When dumping the extent tree, we can see two EXTENT_ITEMs at the same
> start address but different lengths.
> 
> $ btrfs inspect dump-tree /dev/nullb1 -t extent
> ...
>    item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
>    item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
> 
> On a zoned filesystem, limit the size of an ordered extent to the maximum
> size that can be issued as a single REQ_OP_ZONE_APPEND operation.
> 
> Note: This patch breaks fstests btrfs/079, as it increases the number of
> on-disk extents from 80 to 83 per 10M write.
> 
> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/extent_io.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 78d3f2ec90e0..e823b2c74af5 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  				    u64 *end)
>  {
>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
>  	u64 delalloc_start;
>  	u64 delalloc_end;
> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  	int ret;
>  	int loops = 0;
>  
> +	if (fs_info && fs_info->max_zone_append_size)
> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
> +				       PAGE_SIZE);

Why is the alignment needed? Are the max zone append values expected to
be so random? Also it's using memory-related value for something that's
more hw related, or at least extent size (which ends up on disk).

>  again:
>  	/* step one, find a bunch of delalloc bytes starting at start */
>  	delalloc_start = *start;
> -- 
> 2.31.1

Damien Le Moal May 23, 2021, 11:05 p.m. UTC | #4

On 2021/05/22 1:39, David Sterba wrote:
> On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote:
>> Damien reported a test failure with btrfs/209. The test itself ran fine,
>> but the fsck run afterwards reported a corrupted filesystem.
>>
>> The filesystem corruption happens because we're splitting an extent and
>> then writing the extent twice. We have to split the extent though, because
>> we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
>>
>> When dumping the extent tree, we can see two EXTENT_ITEMs at the same
>> start address but different lengths.
>>
>> $ btrfs inspect dump-tree /dev/nullb1 -t extent
>> ...
>>    item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
>>            refs 1 gen 7 flags DATA
>>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
>>    item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
>>            refs 1 gen 7 flags DATA
>>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
>>
>> On a zoned filesystem, limit the size of an ordered extent to the maximum
>> size that can be issued as a single REQ_OP_ZONE_APPEND operation.
>>
>> Note: This patch breaks fstests btrfs/079, as it increases the number of
>> on-disk extents from 80 to 83 per 10M write.
>>
>> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>>  fs/btrfs/extent_io.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 78d3f2ec90e0..e823b2c74af5 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>>  				    u64 *end)
>>  {
>>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
>> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
>>  	u64 delalloc_start;
>>  	u64 delalloc_end;
>> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>>  	int ret;
>>  	int loops = 0;
>>  
>> +	if (fs_info && fs_info->max_zone_append_size)
>> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
>> +				       PAGE_SIZE);
> 
> Why is the alignment needed? Are the max zone append values expected to
> be so random? Also it's using memory-related value for something that's
> more hw related, or at least extent size (which ends up on disk).

It is similar to max_hw_sectors: the hardware decides what the value is. So we
cannot assume anything about what max_zone_append_size is.

I think that Johannes patch here limits the extent size to the HW value to avoid
having to split the extent later one. That is efficient but indeed is a bit of a
layering violation here.

> 
>>  again:
>>  	/* step one, find a bunch of delalloc bytes starting at start */
>>  	delalloc_start = *start;
>> -- 
>> 2.31.1
>

Johannes Thumshirn May 25, 2021, 9:40 a.m. UTC | #5

On 24/05/2021 01:05, Damien Le Moal wrote: 
>>> +	if (fs_info && fs_info->max_zone_append_size)
>>> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
>>> +				       PAGE_SIZE);
>>
>> Why is the alignment needed? Are the max zone append values expected to
>> be so random? Also it's using memory-related value for something that's
>> more hw related, or at least extent size (which ends up on disk).

I did the ALIGN_DOWN() call because we want to have complete pages added.

> It is similar to max_hw_sectors: the hardware decides what the value is. So we
> cannot assume anything about what max_zone_append_size is.
> 
> I think that Johannes patch here limits the extent size to the HW value to avoid
> having to split the extent later one. That is efficient but indeed is a bit of a
> layering violation here.

Damien just brought up a good idea: what about a function to lookup the max extent
size depending on the block group. For regular btrfs it'll for now just return 
BTRFS_MAX_EXTENT_SIZE, for zoned btrfs it'll return 
ALIGN_DOWN(fs_info->max_zone_append_size, PAGE_SIZE) and it also gives us some 
headroom for future improvements in this area.

David Sterba May 31, 2021, 6:58 p.m. UTC | #6

On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote:
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  				    u64 *end)
>  {
>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
>  	u64 delalloc_start;
>  	u64 delalloc_end;
> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  	int ret;
>  	int loops = 0;
>  
> +	if (fs_info && fs_info->max_zone_append_size)

Do you really need to check for a valid fs_info? It's derived from an
inode so it must be valid or something is seriously wrong.

> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
> +				       PAGE_SIZE);

Right now the page alignment sounds ok because the delalloc code works
on page granularity. There's the implicit assumpption that data blocks
are page-sized, but the whole delalloc engine works on pages so no
reason to use anything else.

David Sterba May 31, 2021, 7:02 p.m. UTC | #7

On Tue, May 25, 2021 at 09:40:22AM +0000, Johannes Thumshirn wrote:
> On 24/05/2021 01:05, Damien Le Moal wrote: 
> >>> +	if (fs_info && fs_info->max_zone_append_size)
> >>> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
> >>> +				       PAGE_SIZE);
> >>
> >> Why is the alignment needed? Are the max zone append values expected to
> >> be so random? Also it's using memory-related value for something that's
> >> more hw related, or at least extent size (which ends up on disk).
> 
> I did the ALIGN_DOWN() call because we want to have complete pages added.
> 
> > It is similar to max_hw_sectors: the hardware decides what the value is. So we
> > cannot assume anything about what max_zone_append_size is.
> > 
> > I think that Johannes patch here limits the extent size to the HW value to avoid
> > having to split the extent later one. That is efficient but indeed is a bit of a
> > layering violation here.
> 
> Damien just brought up a good idea: what about a function to lookup the max extent
> size depending on the block group. For regular btrfs it'll for now just return 
> BTRFS_MAX_EXTENT_SIZE, for zoned btrfs it'll return 
> ALIGN_DOWN(fs_info->max_zone_append_size, PAGE_SIZE) and it also gives us some 
> headroom for future improvements in this area.

Hm, right that sounds safer. I've grepped for BTRFS_MAX_EXTENT_SIZE and
it's used in many places so it's not just the one you fixed. If the
maximum extent size is really limited by max_zone_append it needs to be
used consistently everywhere, thus needing a helper.

Johannes Thumshirn June 1, 2021, 7:44 a.m. UTC | #8

On 31/05/2021 21:01, David Sterba wrote:
> On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote:
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>>  				    u64 *end)
>>  {
>>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
>> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
>>  	u64 delalloc_start;
>>  	u64 delalloc_end;
>> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>>  	int ret;
>>  	int loops = 0;
>>  
>> +	if (fs_info && fs_info->max_zone_append_size)
> 
> Do you really need to check for a valid fs_info? It's derived from an
> inode so it must be valid or something is seriously wrong.

I thought it was because some selftest tripped over a NULL pointer, but it looks 
very much like cargo cult. I'll recheck.

> 
>> +		max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size,
>> +				       PAGE_SIZE);
> 
> Right now the page alignment sounds ok because the delalloc code works
> on page granularity. There's the implicit assumpption that data blocks
> are page-sized, but the whole delalloc engine works on pages so no
> reason to use anything else.
>

David Sterba June 1, 2021, 6:58 p.m. UTC | #9

On Tue, Jun 01, 2021 at 07:44:29AM +0000, Johannes Thumshirn wrote:
> On 31/05/2021 21:01, David Sterba wrote:
> > On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote:
> >> --- a/fs/btrfs/extent_io.c
> >> +++ b/fs/btrfs/extent_io.c
> >> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
> >>  				    u64 *end)
> >>  {
> >>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> >> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >>  	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
> >>  	u64 delalloc_start;
> >>  	u64 delalloc_end;
> >> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
> >>  	int ret;
> >>  	int loops = 0;
> >>  
> >> +	if (fs_info && fs_info->max_zone_append_size)
> > 
> > Do you really need to check for a valid fs_info? It's derived from an
> > inode so it must be valid or something is seriously wrong.
> 
> I thought it was because some selftest tripped over a NULL pointer, but it looks 
> very much like cargo cult. I'll recheck.

Ah right, self tests have some exceptions regarding fsinfo though IIRC
ther's always some but maybe not fully setup or with some artifical
nodesize etc to test the structures that are just in memory.

I'd rather not pull selftets-specific code so if it really crashed in
tests then let's please fix the tests.

btrfs: zoned: limit ordered extent to zoned append size

Commit Message

Comments

Patch