Message ID | a858fb2ff980db27b3638e92f7d2d7a416b8e81e.1628776260.git.johannes.thumshirn@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: zoned: exclude relocation and page writeback | expand |
On Thu, Aug 12, 2021 at 10:53:49PM +0900, Johannes Thumshirn wrote: > Relocation in a zoned filesystem can fail with a transaction abort with > error -22 (EINVAL). This happens because the relocation code assumes that > the extents we relocated the data to have the same size the source extents > had and ensures this by preallocating the extents. > > But in a zoned filesystem we can't preallocate the extents as this would > break the sequential write required rule. Therefore it can happen that the > writeback process kicks in while we're still adding pages to a > delallocation range and starts writing out dirty pages. > > This then creates destination extents that are smaller than the source > extents, triggering the following safety check in get_new_location(): > > 1034 if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) { > 1035 ret = -EINVAL; > 1036 goto out; > 1037 } > > One possible solution to address this problem is to mutually exclude page > writeback and adding pages to the relocation inode. This ensures, that > we're not submitting an extent before all needed pages have been added to > it. > > Introduce a new lock in the btrfs_inode which is only taken *IFF* the > inode is a data relocation inode on a zoned filesystem to mutually exclude > relocation's construction of extents and page writeback. > > Fixes: 32430c614844 ("btrfs: zoned: enable relocation on a zoned filesystem") > Reported-by: David Sterba <dsterba@suse.com> > Cc: Filipe Manana <fdmanana@suse.com> > Cc: Naohiro Aota <naohiro.aota@wdc.com> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/btrfs_inode.h | 3 +++ > fs/btrfs/extent_io.c | 4 ++++ > fs/btrfs/inode.c | 1 + > fs/btrfs/relocation.c | 10 +++++++--- > fs/btrfs/zoned.h | 27 +++++++++++++++++++++++++++ > 5 files changed, 42 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h > index 76ee1452c57b..954e772f18e8 100644 > --- a/fs/btrfs/btrfs_inode.h > +++ b/fs/btrfs/btrfs_inode.h > @@ -231,6 +231,9 @@ struct btrfs_inode { > > struct rw_semaphore i_mmap_lock; > struct inode vfs_inode; > + > + /* Protects relocation from page writeback on a zoned FS */ > + struct mutex relocation_lock; That's a lot of new bytes for an inode, and just for zoned mode. Is there another way how to synchronize that? Like some inode state bit and then waiting on it, using the generic wait queues and API like wait_var_event?
On 12/08/2021 16:28, David Sterba wrote: > That's a lot of new bytes for an inode, and just for zoned mode. Is > there another way how to synchronize that? Like some inode state bit and > then waiting on it, using the generic wait queues and API like > wait_var_event? I can look into that yes. Filipe originally suggested using the inode_lock() which would avoid the new mutex as well. I went away from using the inode_lock() mainly for documentation purposes but I can call btrfs_inode_lock() from btrfs_zoned_relocation_io_lock() as well. I'll try implementing #1 and if that fails see if #2 is usable. The longer alternative that Naohiro and Damien also suggested is avoiding zone append on relocation and running a block group that is a target for relocation at QD=1 but that is way more intrusive and not a good candidate for stable IMHO.
On Thu, Aug 12, 2021 at 02:40:59PM +0000, Johannes Thumshirn wrote: > On 12/08/2021 16:28, David Sterba wrote: > > That's a lot of new bytes for an inode, and just for zoned mode. Is > > there another way how to synchronize that? Like some inode state bit and > > then waiting on it, using the generic wait queues and API like > > wait_var_event? > > I can look into that yes. > > Filipe originally suggested using the inode_lock() which would avoid the > new mutex as well. I went away from using the inode_lock() mainly for > documentation purposes but I can call btrfs_inode_lock() from > btrfs_zoned_relocation_io_lock() as well. > > I'll try implementing #1 and if that fails see if #2 is usable. > > The longer alternative that Naohiro and Damien also suggested is avoiding > zone append on relocation and running a block group that is a target for > relocation at QD=1 but that is way more intrusive and not a good candidate > for stable IMHO. The zoned mode still gets improved in each version so we might not limit ourselves by backportability of the fix. 5.12 is not updated anymore and that's the lowest version we care about regardind zoned mode. We could perhaps have first "heavy" solution like the mutex backported to 5.13/5.14 as that we'll probably use as testing base for some time. In the long term I'd rather have something that looks lighter, but I haven't analyzed the bug nor the solutions very closely so can't say which one is the best.
On 12/08/2021 16:53, David Sterba wrote: > On Thu, Aug 12, 2021 at 02:40:59PM +0000, Johannes Thumshirn wrote: >> On 12/08/2021 16:28, David Sterba wrote: >>> That's a lot of new bytes for an inode, and just for zoned mode. Is >>> there another way how to synchronize that? Like some inode state bit and >>> then waiting on it, using the generic wait queues and API like >>> wait_var_event? >> >> I can look into that yes. >> >> Filipe originally suggested using the inode_lock() which would avoid the >> new mutex as well. I went away from using the inode_lock() mainly for >> documentation purposes but I can call btrfs_inode_lock() from >> btrfs_zoned_relocation_io_lock() as well. >> >> I'll try implementing #1 and if that fails see if #2 is usable. >> >> The longer alternative that Naohiro and Damien also suggested is avoiding >> zone append on relocation and running a block group that is a target for >> relocation at QD=1 but that is way more intrusive and not a good candidate >> for stable IMHO. > > The zoned mode still gets improved in each version so we might not limit > ourselves by backportability of the fix. 5.12 is not updated anymore and > that's the lowest version we care about regardind zoned mode. > > We could perhaps have first "heavy" solution like the mutex backported > to 5.13/5.14 as that we'll probably use as testing base for some time. > In the long term I'd rather have something that looks lighter, but I > haven't analyzed the bug nor the solutions very closely so can't say > which one is the best. > JFYI: I did some testing with the inode lock and it looks good but does not necessarily fix all possible problems, i.e. if a ordered extent is being split due to whatever device limits (zone append, max sector size, etc), the assumptions we have in relocation code aren't met again. So the heavy lifting solution with having a dedicated temporary relocation block group (like the treelog block group we already have for zoned) and using regular writes looks like the only solution taking care of all of these problems. That's what I'm working on right now. Byte, Johannes
On Tue, Aug 17, 2021 at 02:21:51PM +0000, Johannes Thumshirn wrote: > On 12/08/2021 16:53, David Sterba wrote: > > On Thu, Aug 12, 2021 at 02:40:59PM +0000, Johannes Thumshirn wrote: > >> On 12/08/2021 16:28, David Sterba wrote: > I did some testing with the inode lock and it looks good but does not > necessarily fix all possible problems, i.e. if a ordered extent is being > split due to whatever device limits (zone append, max sector size, etc), > the assumptions we have in relocation code aren't met again. > > So the heavy lifting solution with having a dedicated temporary relocation > block group (like the treelog block group we already have for zoned) and using > regular writes looks like the only solution taking care of all of these problems. So that means that the minimum number of zones increases again, right. If the separate relocation zone fixes this and possibly other problems then fine, but as you said this is heavy weight solution. We will need a mechanims with a spare block group/zone for emergency cases where we're running out of usable metadata space and need to relocate so this could be building on the same framework. But for first implementation reserving another block group sounds easier.
On 18/08/2021 11:39, David Sterba wrote: > On Tue, Aug 17, 2021 at 02:21:51PM +0000, Johannes Thumshirn wrote: >> On 12/08/2021 16:53, David Sterba wrote: >>> On Thu, Aug 12, 2021 at 02:40:59PM +0000, Johannes Thumshirn wrote: >>>> On 12/08/2021 16:28, David Sterba wrote: > >> I did some testing with the inode lock and it looks good but does not >> necessarily fix all possible problems, i.e. if a ordered extent is being >> split due to whatever device limits (zone append, max sector size, etc), >> the assumptions we have in relocation code aren't met again. >> >> So the heavy lifting solution with having a dedicated temporary relocation >> block group (like the treelog block group we already have for zoned) and using >> regular writes looks like the only solution taking care of all of these problems. > > So that means that the minimum number of zones increases again, right. > If the separate relocation zone fixes this and possibly other problems > then fine, but as you said this is heavy weight solution. > > We will need a mechanims with a spare block group/zone for emergency > cases where we're running out of usable metadata space and need to > relocate so this could be building on the same framework. But for first > implementation reserving another block group sounds easier. > Yes but we need to set aside one block group/zone for relocation anyways to make garbage collection workable. I need to check progs if I have already included this spare zone in the list of minimal required zones. If not I'll post the progs update together with the kernel side, as both need to go together anyways.
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 76ee1452c57b..954e772f18e8 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -231,6 +231,9 @@ struct btrfs_inode { struct rw_semaphore i_mmap_lock; struct inode vfs_inode; + + /* Protects relocation from page writeback on a zoned FS */ + struct mutex relocation_lock; }; static inline u32 btrfs_inode_sectorsize(const struct btrfs_inode *inode) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 96de6e70d06c..59c79eb51612 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -5005,7 +5005,9 @@ static int extent_write_cache_pages(struct address_space *mapping, continue; } + btrfs_zoned_relocation_io_lock(BTRFS_I(inode)); ret = __extent_writepage(page, wbc, epd); + btrfs_zoned_relocation_io_unlock(BTRFS_I(inode)); if (ret < 0) { done = 1; break; @@ -5056,7 +5058,9 @@ int extent_write_full_page(struct page *page, struct writeback_control *wbc) .sync_io = wbc->sync_mode == WB_SYNC_ALL, }; + btrfs_zoned_relocation_io_lock(BTRFS_I(page->mapping->host)); ret = __extent_writepage(page, wbc, &epd); + btrfs_zoned_relocation_io_unlock(BTRFS_I(page->mapping->host)); ASSERT(ret <= 0); if (ret < 0) { end_write_bio(&epd, ret); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index afe9dcda860b..8e8e56f79a86 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9120,6 +9120,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) INIT_LIST_HEAD(&ei->delayed_iput); RB_CLEAR_NODE(&ei->rb_node); init_rwsem(&ei->i_mmap_lock); + mutex_init(&ei->relocation_lock); return inode; } diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 914d403b4415..8630d45e0fca 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -25,6 +25,7 @@ #include "backref.h" #include "misc.h" #include "subpage.h" +#include "zoned.h" /* * Relocation overview @@ -3069,8 +3070,6 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra, unlock_page(page); put_page(page); - balance_dirty_pages_ratelimited(inode->i_mapping); - btrfs_throttle(fs_info); if (btrfs_should_cancel_balance(fs_info)) ret = -ECANCELED; return ret; @@ -3111,9 +3110,14 @@ static int relocate_file_extent_cluster(struct inode *inode, goto out; last_index = (cluster->end - offset) >> PAGE_SHIFT; + btrfs_zoned_relocation_io_lock(BTRFS_I(inode)); for (index = (cluster->start - offset) >> PAGE_SHIFT; - index <= last_index && !ret; index++) + index <= last_index && !ret; index++) { ret = relocate_one_page(inode, ra, cluster, &cluster_nr, index); + } + btrfs_zoned_relocation_io_unlock(BTRFS_I(inode)); + balance_dirty_pages_ratelimited(inode->i_mapping); + btrfs_throttle(fs_info); if (btrfs_is_zoned(fs_info) && !ret) ret = btrfs_wait_ordered_range(inode, 0, (u64)-1); if (ret == 0) diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 4b299705bb12..70d2d65bf5cc 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -8,6 +8,7 @@ #include "volumes.h" #include "disk-io.h" #include "block-group.h" +#include "btrfs_inode.h" /* * Block groups with more than this value (percents) of unusable space will be @@ -304,6 +305,32 @@ static inline void btrfs_zoned_meta_io_unlock(struct btrfs_fs_info *fs_info) mutex_unlock(&fs_info->zoned_meta_io_lock); } +static inline void btrfs_zoned_relocation_io_lock(struct btrfs_inode *inode) +{ + struct btrfs_root *root = inode->root; + + if (!btrfs_is_zoned(root->fs_info)) + return; + + if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) + return; + + mutex_lock(&inode->relocation_lock); +} + +static inline void btrfs_zoned_relocation_io_unlock(struct btrfs_inode *inode) +{ + struct btrfs_root *root = inode->root; + + if (!btrfs_is_zoned(root->fs_info)) + return; + + if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) + return; + + mutex_unlock(&inode->relocation_lock); +} + static inline void btrfs_clear_treelog_bg(struct btrfs_block_group *bg) { struct btrfs_fs_info *fs_info = bg->fs_info;
Relocation in a zoned filesystem can fail with a transaction abort with error -22 (EINVAL). This happens because the relocation code assumes that the extents we relocated the data to have the same size the source extents had and ensures this by preallocating the extents. But in a zoned filesystem we can't preallocate the extents as this would break the sequential write required rule. Therefore it can happen that the writeback process kicks in while we're still adding pages to a delallocation range and starts writing out dirty pages. This then creates destination extents that are smaller than the source extents, triggering the following safety check in get_new_location(): 1034 if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) { 1035 ret = -EINVAL; 1036 goto out; 1037 } One possible solution to address this problem is to mutually exclude page writeback and adding pages to the relocation inode. This ensures, that we're not submitting an extent before all needed pages have been added to it. Introduce a new lock in the btrfs_inode which is only taken *IFF* the inode is a data relocation inode on a zoned filesystem to mutually exclude relocation's construction of extents and page writeback. Fixes: 32430c614844 ("btrfs: zoned: enable relocation on a zoned filesystem") Reported-by: David Sterba <dsterba@suse.com> Cc: Filipe Manana <fdmanana@suse.com> Cc: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> --- fs/btrfs/btrfs_inode.h | 3 +++ fs/btrfs/extent_io.c | 4 ++++ fs/btrfs/inode.c | 1 + fs/btrfs/relocation.c | 10 +++++++--- fs/btrfs/zoned.h | 27 +++++++++++++++++++++++++++ 5 files changed, 42 insertions(+), 3 deletions(-)