Message ID | 65f1b716324a06c5cad99f2737a8669899d4569f.1621588229.git.johannes.thumshirn@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: zoned: limit ordered extent to zoned append size | expand |
On 2021/05/21 18:11, Johannes Thumshirn wrote: > Damien reported a test failure with btrfs/209. The test itself ran fine, > but the fsck run afterwards reported a corrupted filesystem. > > The filesystem corruption happens because we're splitting an extent and > then writing the extent twice. We have to split the extent though, because > we're creating too large extents for a REQ_OP_ZONE_APPEND operation. > > When dumping the extent tree, we can see two EXTENT_ITEMs at the same > start address but different lengths. > > $ btrfs inspect dump-tree /dev/nullb1 -t extent > ... > item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53 > refs 1 gen 7 flags DATA > extent data backref root FS_TREE objectid 257 offset 786432 count 1 > item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53 > refs 1 gen 7 flags DATA > extent data backref root FS_TREE objectid 257 offset 786432 count 1 > > On a zoned filesystem, limit the size of an ordered extent to the maximum > size that can be issued as a single REQ_OP_ZONE_APPEND operation. > > Note: This patch breaks fstests btrfs/079, as it increases the number of > on-disk extents from 80 to 83 per 10M write. Can this test case be fixed by calculating the number of extents that will be written using sysfs zone_append_max_bytes ? That would avoid hard-coding a value for the zoned case... > > Reported-by: Damien Le Moal <damien.lemoal@wdc.com> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/extent_io.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 78d3f2ec90e0..e823b2c74af5 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, > u64 *end) > { > struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; > + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > u64 max_bytes = BTRFS_MAX_EXTENT_SIZE; > u64 delalloc_start; > u64 delalloc_end; > @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, > int ret; > int loops = 0; > > + if (fs_info && fs_info->max_zone_append_size) > + max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size, > + PAGE_SIZE); > again: > /* step one, find a bunch of delalloc bytes starting at start */ > delalloc_start = *start; >
On 21/05/2021 11:46, Damien Le Moal wrote: >> Note: This patch breaks fstests btrfs/079, as it increases the number of >> on-disk extents from 80 to 83 per 10M write. > Can this test case be fixed by calculating the number of extents that will be > written using sysfs zone_append_max_bytes ? That would avoid hard-coding a value > for the zoned case... Probably yes, but I'd like to hear from others how important they see the hard coded value.
On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote: > Damien reported a test failure with btrfs/209. The test itself ran fine, > but the fsck run afterwards reported a corrupted filesystem. > > The filesystem corruption happens because we're splitting an extent and > then writing the extent twice. We have to split the extent though, because > we're creating too large extents for a REQ_OP_ZONE_APPEND operation. > > When dumping the extent tree, we can see two EXTENT_ITEMs at the same > start address but different lengths. > > $ btrfs inspect dump-tree /dev/nullb1 -t extent > ... > item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53 > refs 1 gen 7 flags DATA > extent data backref root FS_TREE objectid 257 offset 786432 count 1 > item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53 > refs 1 gen 7 flags DATA > extent data backref root FS_TREE objectid 257 offset 786432 count 1 > > On a zoned filesystem, limit the size of an ordered extent to the maximum > size that can be issued as a single REQ_OP_ZONE_APPEND operation. > > Note: This patch breaks fstests btrfs/079, as it increases the number of > on-disk extents from 80 to 83 per 10M write. > > Reported-by: Damien Le Moal <damien.lemoal@wdc.com> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/extent_io.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 78d3f2ec90e0..e823b2c74af5 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, > u64 *end) > { > struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; > + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > u64 max_bytes = BTRFS_MAX_EXTENT_SIZE; > u64 delalloc_start; > u64 delalloc_end; > @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, > int ret; > int loops = 0; > > + if (fs_info && fs_info->max_zone_append_size) > + max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size, > + PAGE_SIZE); Why is the alignment needed? Are the max zone append values expected to be so random? Also it's using memory-related value for something that's more hw related, or at least extent size (which ends up on disk). > again: > /* step one, find a bunch of delalloc bytes starting at start */ > delalloc_start = *start; > -- > 2.31.1
On 2021/05/22 1:39, David Sterba wrote: > On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote: >> Damien reported a test failure with btrfs/209. The test itself ran fine, >> but the fsck run afterwards reported a corrupted filesystem. >> >> The filesystem corruption happens because we're splitting an extent and >> then writing the extent twice. We have to split the extent though, because >> we're creating too large extents for a REQ_OP_ZONE_APPEND operation. >> >> When dumping the extent tree, we can see two EXTENT_ITEMs at the same >> start address but different lengths. >> >> $ btrfs inspect dump-tree /dev/nullb1 -t extent >> ... >> item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53 >> refs 1 gen 7 flags DATA >> extent data backref root FS_TREE objectid 257 offset 786432 count 1 >> item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53 >> refs 1 gen 7 flags DATA >> extent data backref root FS_TREE objectid 257 offset 786432 count 1 >> >> On a zoned filesystem, limit the size of an ordered extent to the maximum >> size that can be issued as a single REQ_OP_ZONE_APPEND operation. >> >> Note: This patch breaks fstests btrfs/079, as it increases the number of >> on-disk extents from 80 to 83 per 10M write. >> >> Reported-by: Damien Le Moal <damien.lemoal@wdc.com> >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >> --- >> fs/btrfs/extent_io.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index 78d3f2ec90e0..e823b2c74af5 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, >> u64 *end) >> { >> struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; >> + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); >> u64 max_bytes = BTRFS_MAX_EXTENT_SIZE; >> u64 delalloc_start; >> u64 delalloc_end; >> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, >> int ret; >> int loops = 0; >> >> + if (fs_info && fs_info->max_zone_append_size) >> + max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size, >> + PAGE_SIZE); > > Why is the alignment needed? Are the max zone append values expected to > be so random? Also it's using memory-related value for something that's > more hw related, or at least extent size (which ends up on disk). It is similar to max_hw_sectors: the hardware decides what the value is. So we cannot assume anything about what max_zone_append_size is. I think that Johannes patch here limits the extent size to the HW value to avoid having to split the extent later one. That is efficient but indeed is a bit of a layering violation here. > >> again: >> /* step one, find a bunch of delalloc bytes starting at start */ >> delalloc_start = *start; >> -- >> 2.31.1 >
On 24/05/2021 01:05, Damien Le Moal wrote: >>> + if (fs_info && fs_info->max_zone_append_size) >>> + max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size, >>> + PAGE_SIZE); >> >> Why is the alignment needed? Are the max zone append values expected to >> be so random? Also it's using memory-related value for something that's >> more hw related, or at least extent size (which ends up on disk). I did the ALIGN_DOWN() call because we want to have complete pages added. > It is similar to max_hw_sectors: the hardware decides what the value is. So we > cannot assume anything about what max_zone_append_size is. > > I think that Johannes patch here limits the extent size to the HW value to avoid > having to split the extent later one. That is efficient but indeed is a bit of a > layering violation here. Damien just brought up a good idea: what about a function to lookup the max extent size depending on the block group. For regular btrfs it'll for now just return BTRFS_MAX_EXTENT_SIZE, for zoned btrfs it'll return ALIGN_DOWN(fs_info->max_zone_append_size, PAGE_SIZE) and it also gives us some headroom for future improvements in this area.
On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote: > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, > u64 *end) > { > struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; > + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > u64 max_bytes = BTRFS_MAX_EXTENT_SIZE; > u64 delalloc_start; > u64 delalloc_end; > @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, > int ret; > int loops = 0; > > + if (fs_info && fs_info->max_zone_append_size) Do you really need to check for a valid fs_info? It's derived from an inode so it must be valid or something is seriously wrong. > + max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size, > + PAGE_SIZE); Right now the page alignment sounds ok because the delalloc code works on page granularity. There's the implicit assumpption that data blocks are page-sized, but the whole delalloc engine works on pages so no reason to use anything else.
On Tue, May 25, 2021 at 09:40:22AM +0000, Johannes Thumshirn wrote: > On 24/05/2021 01:05, Damien Le Moal wrote: > >>> + if (fs_info && fs_info->max_zone_append_size) > >>> + max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size, > >>> + PAGE_SIZE); > >> > >> Why is the alignment needed? Are the max zone append values expected to > >> be so random? Also it's using memory-related value for something that's > >> more hw related, or at least extent size (which ends up on disk). > > I did the ALIGN_DOWN() call because we want to have complete pages added. > > > It is similar to max_hw_sectors: the hardware decides what the value is. So we > > cannot assume anything about what max_zone_append_size is. > > > > I think that Johannes patch here limits the extent size to the HW value to avoid > > having to split the extent later one. That is efficient but indeed is a bit of a > > layering violation here. > > Damien just brought up a good idea: what about a function to lookup the max extent > size depending on the block group. For regular btrfs it'll for now just return > BTRFS_MAX_EXTENT_SIZE, for zoned btrfs it'll return > ALIGN_DOWN(fs_info->max_zone_append_size, PAGE_SIZE) and it also gives us some > headroom for future improvements in this area. Hm, right that sounds safer. I've grepped for BTRFS_MAX_EXTENT_SIZE and it's used in many places so it's not just the one you fixed. If the maximum extent size is really limited by max_zone_append it needs to be used consistently everywhere, thus needing a helper.
On 31/05/2021 21:01, David Sterba wrote: > On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote: >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, >> u64 *end) >> { >> struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; >> + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); >> u64 max_bytes = BTRFS_MAX_EXTENT_SIZE; >> u64 delalloc_start; >> u64 delalloc_end; >> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, >> int ret; >> int loops = 0; >> >> + if (fs_info && fs_info->max_zone_append_size) > > Do you really need to check for a valid fs_info? It's derived from an > inode so it must be valid or something is seriously wrong. I thought it was because some selftest tripped over a NULL pointer, but it looks very much like cargo cult. I'll recheck. > >> + max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size, >> + PAGE_SIZE); > > Right now the page alignment sounds ok because the delalloc code works > on page granularity. There's the implicit assumpption that data blocks > are page-sized, but the whole delalloc engine works on pages so no > reason to use anything else. >
On Tue, Jun 01, 2021 at 07:44:29AM +0000, Johannes Thumshirn wrote: > On 31/05/2021 21:01, David Sterba wrote: > > On Fri, May 21, 2021 at 06:11:04PM +0900, Johannes Thumshirn wrote: > >> --- a/fs/btrfs/extent_io.c > >> +++ b/fs/btrfs/extent_io.c > >> @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, > >> u64 *end) > >> { > >> struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; > >> + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > >> u64 max_bytes = BTRFS_MAX_EXTENT_SIZE; > >> u64 delalloc_start; > >> u64 delalloc_end; > >> @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, > >> int ret; > >> int loops = 0; > >> > >> + if (fs_info && fs_info->max_zone_append_size) > > > > Do you really need to check for a valid fs_info? It's derived from an > > inode so it must be valid or something is seriously wrong. > > I thought it was because some selftest tripped over a NULL pointer, but it looks > very much like cargo cult. I'll recheck. Ah right, self tests have some exceptions regarding fsinfo though IIRC ther's always some but maybe not fully setup or with some artifical nodesize etc to test the structures that are just in memory. I'd rather not pull selftets-specific code so if it really crashed in tests then let's please fix the tests.
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 78d3f2ec90e0..e823b2c74af5 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1860,6 +1860,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, u64 *end) { struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); u64 max_bytes = BTRFS_MAX_EXTENT_SIZE; u64 delalloc_start; u64 delalloc_end; @@ -1868,6 +1869,9 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, int ret; int loops = 0; + if (fs_info && fs_info->max_zone_append_size) + max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size, + PAGE_SIZE); again: /* step one, find a bunch of delalloc bytes starting at start */ delalloc_start = *start;
Damien reported a test failure with btrfs/209. The test itself ran fine, but the fsck run afterwards reported a corrupted filesystem. The filesystem corruption happens because we're splitting an extent and then writing the extent twice. We have to split the extent though, because we're creating too large extents for a REQ_OP_ZONE_APPEND operation. When dumping the extent tree, we can see two EXTENT_ITEMs at the same start address but different lengths. $ btrfs inspect dump-tree /dev/nullb1 -t extent ... item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 257 offset 786432 count 1 item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 257 offset 786432 count 1 On a zoned filesystem, limit the size of an ordered extent to the maximum size that can be issued as a single REQ_OP_ZONE_APPEND operation. Note: This patch breaks fstests btrfs/079, as it increases the number of on-disk extents from 80 to 83 per 10M write. Reported-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> --- fs/btrfs/extent_io.c | 4 ++++ 1 file changed, 4 insertions(+)