Message ID | 20191112151331.3641-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES | expand |
On Tue, Nov 12, 2019 at 03:13:31PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When using the NO_HOLES feature, if we punch a hole into a file and then > fsync it, there is a case where a subsequent fsync will miss the fact that > a hole was punched: > > 1) The extent items of the inode span multiple leafs; > > 2) The hole covers a range that affects only the extent items of the first > leaf; > > 3) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC > is set in the inode's runtime flags). > > That results in the hole not existing after replaying the log tree. > > For example, if the fs/subvolume tree has the following layout for a > particular inode: > > Leaf N, generation 10: > > [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ] > > Leaf N + 1, generation 10: > > [ EXTENT_ITEM (128K 64K) ... ] > > If at transaction 11 we punch a hole coverting the range [0, 128K[, we end > up dropping the two extent items from leaf N, but we don't touch the other > leaf, so we end up in the following state: > > Leaf N, generation 11: > > [ ... INODE_ITEM INODE_REF ] > > Leaf N + 1, generation 10: > > [ EXTENT_ITEM (128K 64K) ... ] > > A full fsync after punching the hole will only process leaf N because it > was modified in the current transaction, but not leaf N + 1, since it was > not modified in the current transaction (generation 10 and not 11). As > a result the fsync will not log any holes, because it didn't process any > leaf with extent items. > > So fix this by detecting any leading hole in the file for a full fsync > when using the NO_HOLES feature if we didn't process any extent items for > the file. > > A test case for fstests follows soon. > > Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents") > Signed-off-by: Filipe Manana <fdmanana@suse.com> This adds an extra search for every FULL_SYNC, can we just catch this case in the main loop, say we keep track of the last extent we found, and then when we end up with ret > 1 || a min_key that's past the end of the last extent we saw we know we had a hole punch? Thanks, Josef
On Tue, Nov 12, 2019 at 5:35 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Tue, Nov 12, 2019 at 03:13:31PM +0000, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > When using the NO_HOLES feature, if we punch a hole into a file and then > > fsync it, there is a case where a subsequent fsync will miss the fact that > > a hole was punched: > > > > 1) The extent items of the inode span multiple leafs; > > > > 2) The hole covers a range that affects only the extent items of the first > > leaf; > > > > 3) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC > > is set in the inode's runtime flags). > > > > That results in the hole not existing after replaying the log tree. > > > > For example, if the fs/subvolume tree has the following layout for a > > particular inode: > > > > Leaf N, generation 10: > > > > [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ] > > > > Leaf N + 1, generation 10: > > > > [ EXTENT_ITEM (128K 64K) ... ] > > > > If at transaction 11 we punch a hole coverting the range [0, 128K[, we end > > up dropping the two extent items from leaf N, but we don't touch the other > > leaf, so we end up in the following state: > > > > Leaf N, generation 11: > > > > [ ... INODE_ITEM INODE_REF ] > > > > Leaf N + 1, generation 10: > > > > [ EXTENT_ITEM (128K 64K) ... ] > > > > A full fsync after punching the hole will only process leaf N because it > > was modified in the current transaction, but not leaf N + 1, since it was > > not modified in the current transaction (generation 10 and not 11). As > > a result the fsync will not log any holes, because it didn't process any > > leaf with extent items. > > > > So fix this by detecting any leading hole in the file for a full fsync > > when using the NO_HOLES feature if we didn't process any extent items for > > the file. > > > > A test case for fstests follows soon. > > > > Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents") > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > This adds an extra search for every FULL_SYNC, can we just catch this case in > the main loop, say we keep track of the last extent we found, It's already doing that by checking if "last_extent == 0" before calling the new function. Having last_extent == 0, no extents processed is very rare (hitting that specific item layout and hole range). > and then when we > end up with ret > 1 || a min_key that's past the end of the last extent we saw > we know we had a hole punch? Thanks, > > Josef
On Tue, Nov 12, 2019 at 05:39:59PM +0000, Filipe Manana wrote: > On Tue, Nov 12, 2019 at 5:35 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > On Tue, Nov 12, 2019 at 03:13:31PM +0000, fdmanana@kernel.org wrote: > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > When using the NO_HOLES feature, if we punch a hole into a file and then > > > fsync it, there is a case where a subsequent fsync will miss the fact that > > > a hole was punched: > > > > > > 1) The extent items of the inode span multiple leafs; > > > > > > 2) The hole covers a range that affects only the extent items of the first > > > leaf; > > > > > > 3) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC > > > is set in the inode's runtime flags). > > > > > > That results in the hole not existing after replaying the log tree. > > > > > > For example, if the fs/subvolume tree has the following layout for a > > > particular inode: > > > > > > Leaf N, generation 10: > > > > > > [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ] > > > > > > Leaf N + 1, generation 10: > > > > > > [ EXTENT_ITEM (128K 64K) ... ] > > > > > > If at transaction 11 we punch a hole coverting the range [0, 128K[, we end > > > up dropping the two extent items from leaf N, but we don't touch the other > > > leaf, so we end up in the following state: > > > > > > Leaf N, generation 11: > > > > > > [ ... INODE_ITEM INODE_REF ] > > > > > > Leaf N + 1, generation 10: > > > > > > [ EXTENT_ITEM (128K 64K) ... ] > > > > > > A full fsync after punching the hole will only process leaf N because it > > > was modified in the current transaction, but not leaf N + 1, since it was > > > not modified in the current transaction (generation 10 and not 11). As > > > a result the fsync will not log any holes, because it didn't process any > > > leaf with extent items. > > > > > > So fix this by detecting any leading hole in the file for a full fsync > > > when using the NO_HOLES feature if we didn't process any extent items for > > > the file. > > > > > > A test case for fstests follows soon. > > > > > > Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents") > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > > This adds an extra search for every FULL_SYNC, can we just catch this case in > > the main loop, say we keep track of the last extent we found, > > It's already doing that by checking if "last_extent == 0" before > calling the new function. > Having last_extent == 0, no extents processed is very rare (hitting > that specific item layout and hole range). > Yup you're right, I missed that bit. You can add Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Tue, Nov 12, 2019 at 03:13:31PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When using the NO_HOLES feature, if we punch a hole into a file and then > fsync it, there is a case where a subsequent fsync will miss the fact that > a hole was punched: > > 1) The extent items of the inode span multiple leafs; > > 2) The hole covers a range that affects only the extent items of the first > leaf; > > 3) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC > is set in the inode's runtime flags). > > That results in the hole not existing after replaying the log tree. > > For example, if the fs/subvolume tree has the following layout for a > particular inode: > > Leaf N, generation 10: > > [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ] > > Leaf N + 1, generation 10: > > [ EXTENT_ITEM (128K 64K) ... ] > > If at transaction 11 we punch a hole coverting the range [0, 128K[, we end > up dropping the two extent items from leaf N, but we don't touch the other > leaf, so we end up in the following state: > > Leaf N, generation 11: > > [ ... INODE_ITEM INODE_REF ] > > Leaf N + 1, generation 10: > > [ EXTENT_ITEM (128K 64K) ... ] > > A full fsync after punching the hole will only process leaf N because it > was modified in the current transaction, but not leaf N + 1, since it was > not modified in the current transaction (generation 10 and not 11). As > a result the fsync will not log any holes, because it didn't process any > leaf with extent items. > > So fix this by detecting any leading hole in the file for a full fsync > when using the NO_HOLES feature if we didn't process any extent items for > the file. > > A test case for fstests follows soon. > > Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Added to misc-next, thanks.
On Thu, Nov 14, 2019 at 3:18 PM David Sterba <dsterba@suse.cz> wrote: > > On Tue, Nov 12, 2019 at 03:13:31PM +0000, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > When using the NO_HOLES feature, if we punch a hole into a file and then > > fsync it, there is a case where a subsequent fsync will miss the fact that > > a hole was punched: > > > > 1) The extent items of the inode span multiple leafs; > > > > 2) The hole covers a range that affects only the extent items of the first > > leaf; > > > > 3) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC > > is set in the inode's runtime flags). > > > > That results in the hole not existing after replaying the log tree. > > > > For example, if the fs/subvolume tree has the following layout for a > > particular inode: > > > > Leaf N, generation 10: > > > > [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ] > > > > Leaf N + 1, generation 10: > > > > [ EXTENT_ITEM (128K 64K) ... ] > > > > If at transaction 11 we punch a hole coverting the range [0, 128K[, we end > > up dropping the two extent items from leaf N, but we don't touch the other > > leaf, so we end up in the following state: > > > > Leaf N, generation 11: > > > > [ ... INODE_ITEM INODE_REF ] > > > > Leaf N + 1, generation 10: > > > > [ EXTENT_ITEM (128K 64K) ... ] > > > > A full fsync after punching the hole will only process leaf N because it > > was modified in the current transaction, but not leaf N + 1, since it was > > not modified in the current transaction (generation 10 and not 11). As > > a result the fsync will not log any holes, because it didn't process any > > leaf with extent items. > > > > So fix this by detecting any leading hole in the file for a full fsync > > when using the NO_HOLES feature if we didn't process any extent items for > > the file. > > > > A test case for fstests follows soon. > > > > Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents") > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > Added to misc-next, thanks. Actually, could you hold on a bit for this one? There's nothing wrong with it, I'm simply fixing other cases and realized I can fix them all in a single patch. If I take too long or end up not being able to fix all as I'm expecting, I'll let you know, otherwise I'll send a very different v2 tomorrow. Thanks.
On Thu, Nov 14, 2019 at 03:45:38PM +0000, Filipe Manana wrote: > > > Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents") > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > > Added to misc-next, thanks. > > Actually, could you hold on a bit for this one? > There's nothing wrong with it, I'm simply fixing other cases and > realized I can fix them all in a single patch. > If I take too long or end up not being able to fix all as I'm > expecting, I'll let you know, otherwise I'll send a very different v2 > tomorrow. Sure, no problem. Patch removed from misc-next. Thanks.
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 8a6cc600bf18..dfc07d23c903 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4807,6 +4807,68 @@ static int btrfs_log_trailing_hole(struct btrfs_trans_handle *trans, } /* + * When using the NO_HOLES feature we need special care to make sure we don't + * miss a hole that starts at file offset 0. If we have an inode with extent + * items spanning multiple leafs and we punch a hole covering only the extents + * in the first leaf, then the hole punching operation deletes those extent + * items from the leaf without touching the remaining leafs (unless the leaf + * becomes too small in which case we may move items from the first leaf to the + * second leaf) - this causes a full fsync to not copy any extent items and + * therefore not detect and log a hole starting at offset 0. We check for such + * scenario here and log the hole. + */ +static int btrfs_log_leading_hole(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_inode *inode, + struct btrfs_path *path) +{ + struct btrfs_key key; + const u64 ino = btrfs_ino(inode); + u64 hole_len; + int ret; + + if (!btrfs_fs_incompat(root->fs_info, NO_HOLES)) + return 0; + + if (i_size_read(&inode->vfs_inode) == 0) + return 0; + + key.objectid = ino; + key.type = BTRFS_EXTENT_DATA_KEY; + key.offset = 0; + + /* + * If there's an extent item at file offset 0 we don't have anything + * to do, just return. + */ + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret <= 0) + return ret; + + /* + * Figure out the size of the hole. If there is no next extent, just + * exit without doing anything (if the file consists of a single hole, + * has no extents, we log the hole at btrfs_log_trailing_hole()). + */ + if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) { + ret = btrfs_next_leaf(root, path); + if (ret < 0) + return ret; + else if (ret > 0) + return 0; + } + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) + return 0; + + hole_len = key.offset; + btrfs_release_path(path); + ret = btrfs_insert_file_extent(trans, root->log_root, ino, 0, 0, 0, + hole_len, 0, hole_len, 0, 0, 0); + return ret; +} + +/* * When we are logging a new inode X, check if it doesn't have a reference that * matches the reference from some other inode Y created in a past transaction * and that was renamed in the current transaction. If we don't do this, then at @@ -5393,6 +5455,18 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, if (err) goto out_unlock; xattrs_logged = true; + /* + * If we are doing a full fsync and didn't copy any extent items, we + * may need to log a leading hole extent item when using the NO_HOLES + * feature, otherwise we end up not persisting the hole. + */ + if (!fast_search && last_extent == 0) { + btrfs_release_path(path); + btrfs_release_path(dst_path); + err = btrfs_log_leading_hole(trans, root, inode, path); + if (err) + goto out_unlock; + } if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) { btrfs_release_path(path); btrfs_release_path(dst_path);