diff mbox series

Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES

Message ID 20191112151331.3641-1-fdmanana@kernel.org (mailing list archive)
State New, archived
Headers show
Series Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES | expand

Commit Message

Filipe Manana Nov. 12, 2019, 3:13 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When using the NO_HOLES feature, if we punch a hole into a file and then
fsync it, there is a case where a subsequent fsync will miss the fact that
a hole was punched:

1) The extent items of the inode span multiple leafs;

2) The hole covers a range that affects only the extent items of the first
   leaf;

3) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
   is set in the inode's runtime flags).

That results in the hole not existing after replaying the log tree.

For example, if the fs/subvolume tree has the following layout for a
particular inode:

  Leaf N, generation 10:

  [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]

  Leaf N + 1, generation 10:

  [ EXTENT_ITEM (128K 64K) ... ]

If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
up dropping the two extent items from leaf N, but we don't touch the other
leaf, so we end up in the following state:

  Leaf N, generation 11:

  [ ... INODE_ITEM INODE_REF ]

  Leaf N + 1, generation 10:

  [ EXTENT_ITEM (128K 64K) ... ]

A full fsync after punching the hole will only process leaf N because it
was modified in the current transaction, but not leaf N + 1, since it was
not modified in the current transaction (generation 10 and not 11). As
a result the fsync will not log any holes, because it didn't process any
leaf with extent items.

So fix this by detecting any leading hole in the file for a full fsync
when using the NO_HOLES feature if we didn't process any extent items for
the file.

A test case for fstests follows soon.

Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

Comments

Josef Bacik Nov. 12, 2019, 5:34 p.m. UTC | #1
On Tue, Nov 12, 2019 at 03:13:31PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When using the NO_HOLES feature, if we punch a hole into a file and then
> fsync it, there is a case where a subsequent fsync will miss the fact that
> a hole was punched:
> 
> 1) The extent items of the inode span multiple leafs;
> 
> 2) The hole covers a range that affects only the extent items of the first
>    leaf;
> 
> 3) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
>    is set in the inode's runtime flags).
> 
> That results in the hole not existing after replaying the log tree.
> 
> For example, if the fs/subvolume tree has the following layout for a
> particular inode:
> 
>   Leaf N, generation 10:
> 
>   [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]
> 
>   Leaf N + 1, generation 10:
> 
>   [ EXTENT_ITEM (128K 64K) ... ]
> 
> If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
> up dropping the two extent items from leaf N, but we don't touch the other
> leaf, so we end up in the following state:
> 
>   Leaf N, generation 11:
> 
>   [ ... INODE_ITEM INODE_REF ]
> 
>   Leaf N + 1, generation 10:
> 
>   [ EXTENT_ITEM (128K 64K) ... ]
> 
> A full fsync after punching the hole will only process leaf N because it
> was modified in the current transaction, but not leaf N + 1, since it was
> not modified in the current transaction (generation 10 and not 11). As
> a result the fsync will not log any holes, because it didn't process any
> leaf with extent items.
> 
> So fix this by detecting any leading hole in the file for a full fsync
> when using the NO_HOLES feature if we didn't process any extent items for
> the file.
> 
> A test case for fstests follows soon.
> 
> Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

This adds an extra search for every FULL_SYNC, can we just catch this case in
the main loop, say we keep track of the last extent we found, and then when we
end up with ret > 1 || a min_key that's past the end of the last extent we saw
we know we had a hole punch?  Thanks,

Josef
Filipe Manana Nov. 12, 2019, 5:39 p.m. UTC | #2
On Tue, Nov 12, 2019 at 5:35 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Tue, Nov 12, 2019 at 03:13:31PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When using the NO_HOLES feature, if we punch a hole into a file and then
> > fsync it, there is a case where a subsequent fsync will miss the fact that
> > a hole was punched:
> >
> > 1) The extent items of the inode span multiple leafs;
> >
> > 2) The hole covers a range that affects only the extent items of the first
> >    leaf;
> >
> > 3) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
> >    is set in the inode's runtime flags).
> >
> > That results in the hole not existing after replaying the log tree.
> >
> > For example, if the fs/subvolume tree has the following layout for a
> > particular inode:
> >
> >   Leaf N, generation 10:
> >
> >   [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]
> >
> >   Leaf N + 1, generation 10:
> >
> >   [ EXTENT_ITEM (128K 64K) ... ]
> >
> > If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
> > up dropping the two extent items from leaf N, but we don't touch the other
> > leaf, so we end up in the following state:
> >
> >   Leaf N, generation 11:
> >
> >   [ ... INODE_ITEM INODE_REF ]
> >
> >   Leaf N + 1, generation 10:
> >
> >   [ EXTENT_ITEM (128K 64K) ... ]
> >
> > A full fsync after punching the hole will only process leaf N because it
> > was modified in the current transaction, but not leaf N + 1, since it was
> > not modified in the current transaction (generation 10 and not 11). As
> > a result the fsync will not log any holes, because it didn't process any
> > leaf with extent items.
> >
> > So fix this by detecting any leading hole in the file for a full fsync
> > when using the NO_HOLES feature if we didn't process any extent items for
> > the file.
> >
> > A test case for fstests follows soon.
> >
> > Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> This adds an extra search for every FULL_SYNC, can we just catch this case in
> the main loop, say we keep track of the last extent we found,

It's already doing that by checking if "last_extent == 0" before
calling the new function.
Having last_extent == 0, no extents processed is very rare (hitting
that specific item layout and hole range).

> and then when we
> end up with ret > 1 || a min_key that's past the end of the last extent we saw
> we know we had a hole punch?  Thanks,
>
> Josef
Josef Bacik Nov. 12, 2019, 5:43 p.m. UTC | #3
On Tue, Nov 12, 2019 at 05:39:59PM +0000, Filipe Manana wrote:
> On Tue, Nov 12, 2019 at 5:35 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Tue, Nov 12, 2019 at 03:13:31PM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > When using the NO_HOLES feature, if we punch a hole into a file and then
> > > fsync it, there is a case where a subsequent fsync will miss the fact that
> > > a hole was punched:
> > >
> > > 1) The extent items of the inode span multiple leafs;
> > >
> > > 2) The hole covers a range that affects only the extent items of the first
> > >    leaf;
> > >
> > > 3) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
> > >    is set in the inode's runtime flags).
> > >
> > > That results in the hole not existing after replaying the log tree.
> > >
> > > For example, if the fs/subvolume tree has the following layout for a
> > > particular inode:
> > >
> > >   Leaf N, generation 10:
> > >
> > >   [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]
> > >
> > >   Leaf N + 1, generation 10:
> > >
> > >   [ EXTENT_ITEM (128K 64K) ... ]
> > >
> > > If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
> > > up dropping the two extent items from leaf N, but we don't touch the other
> > > leaf, so we end up in the following state:
> > >
> > >   Leaf N, generation 11:
> > >
> > >   [ ... INODE_ITEM INODE_REF ]
> > >
> > >   Leaf N + 1, generation 10:
> > >
> > >   [ EXTENT_ITEM (128K 64K) ... ]
> > >
> > > A full fsync after punching the hole will only process leaf N because it
> > > was modified in the current transaction, but not leaf N + 1, since it was
> > > not modified in the current transaction (generation 10 and not 11). As
> > > a result the fsync will not log any holes, because it didn't process any
> > > leaf with extent items.
> > >
> > > So fix this by detecting any leading hole in the file for a full fsync
> > > when using the NO_HOLES feature if we didn't process any extent items for
> > > the file.
> > >
> > > A test case for fstests follows soon.
> > >
> > > Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >
> > This adds an extra search for every FULL_SYNC, can we just catch this case in
> > the main loop, say we keep track of the last extent we found,
> 
> It's already doing that by checking if "last_extent == 0" before
> calling the new function.
> Having last_extent == 0, no extents processed is very rare (hitting
> that specific item layout and hole range).
> 

Yup you're right, I missed that bit.  You can add 

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
David Sterba Nov. 14, 2019, 3:18 p.m. UTC | #4
On Tue, Nov 12, 2019 at 03:13:31PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When using the NO_HOLES feature, if we punch a hole into a file and then
> fsync it, there is a case where a subsequent fsync will miss the fact that
> a hole was punched:
> 
> 1) The extent items of the inode span multiple leafs;
> 
> 2) The hole covers a range that affects only the extent items of the first
>    leaf;
> 
> 3) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
>    is set in the inode's runtime flags).
> 
> That results in the hole not existing after replaying the log tree.
> 
> For example, if the fs/subvolume tree has the following layout for a
> particular inode:
> 
>   Leaf N, generation 10:
> 
>   [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]
> 
>   Leaf N + 1, generation 10:
> 
>   [ EXTENT_ITEM (128K 64K) ... ]
> 
> If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
> up dropping the two extent items from leaf N, but we don't touch the other
> leaf, so we end up in the following state:
> 
>   Leaf N, generation 11:
> 
>   [ ... INODE_ITEM INODE_REF ]
> 
>   Leaf N + 1, generation 10:
> 
>   [ EXTENT_ITEM (128K 64K) ... ]
> 
> A full fsync after punching the hole will only process leaf N because it
> was modified in the current transaction, but not leaf N + 1, since it was
> not modified in the current transaction (generation 10 and not 11). As
> a result the fsync will not log any holes, because it didn't process any
> leaf with extent items.
> 
> So fix this by detecting any leading hole in the file for a full fsync
> when using the NO_HOLES feature if we didn't process any extent items for
> the file.
> 
> A test case for fstests follows soon.
> 
> Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Added to misc-next, thanks.
Filipe Manana Nov. 14, 2019, 3:45 p.m. UTC | #5
On Thu, Nov 14, 2019 at 3:18 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Tue, Nov 12, 2019 at 03:13:31PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When using the NO_HOLES feature, if we punch a hole into a file and then
> > fsync it, there is a case where a subsequent fsync will miss the fact that
> > a hole was punched:
> >
> > 1) The extent items of the inode span multiple leafs;
> >
> > 2) The hole covers a range that affects only the extent items of the first
> >    leaf;
> >
> > 3) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
> >    is set in the inode's runtime flags).
> >
> > That results in the hole not existing after replaying the log tree.
> >
> > For example, if the fs/subvolume tree has the following layout for a
> > particular inode:
> >
> >   Leaf N, generation 10:
> >
> >   [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]
> >
> >   Leaf N + 1, generation 10:
> >
> >   [ EXTENT_ITEM (128K 64K) ... ]
> >
> > If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
> > up dropping the two extent items from leaf N, but we don't touch the other
> > leaf, so we end up in the following state:
> >
> >   Leaf N, generation 11:
> >
> >   [ ... INODE_ITEM INODE_REF ]
> >
> >   Leaf N + 1, generation 10:
> >
> >   [ EXTENT_ITEM (128K 64K) ... ]
> >
> > A full fsync after punching the hole will only process leaf N because it
> > was modified in the current transaction, but not leaf N + 1, since it was
> > not modified in the current transaction (generation 10 and not 11). As
> > a result the fsync will not log any holes, because it didn't process any
> > leaf with extent items.
> >
> > So fix this by detecting any leading hole in the file for a full fsync
> > when using the NO_HOLES feature if we didn't process any extent items for
> > the file.
> >
> > A test case for fstests follows soon.
> >
> > Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> Added to misc-next, thanks.

Actually, could you hold on a bit for this one?
There's nothing wrong with it, I'm simply fixing other cases and
realized I can fix them all in a single patch.
If I take too long or end up not being able to fix all as I'm
expecting, I'll let you know, otherwise I'll send a very different v2
tomorrow.

Thanks.
David Sterba Nov. 15, 2019, 9:42 a.m. UTC | #6
On Thu, Nov 14, 2019 at 03:45:38PM +0000, Filipe Manana wrote:
> > > Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents")
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >
> > Added to misc-next, thanks.
> 
> Actually, could you hold on a bit for this one?
> There's nothing wrong with it, I'm simply fixing other cases and
> realized I can fix them all in a single patch.
> If I take too long or end up not being able to fix all as I'm
> expecting, I'll let you know, otherwise I'll send a very different v2
> tomorrow.

Sure, no problem. Patch removed from misc-next. Thanks.
diff mbox series

Patch

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8a6cc600bf18..dfc07d23c903 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4807,6 +4807,68 @@  static int btrfs_log_trailing_hole(struct btrfs_trans_handle *trans,
 }
 
 /*
+ * When using the NO_HOLES feature we need special care to make sure we don't
+ * miss a hole that starts at file offset 0. If we have an inode with extent
+ * items spanning multiple leafs and we punch a hole covering only the extents
+ * in the first leaf, then the hole punching operation deletes those extent
+ * items from the leaf without touching the remaining leafs (unless the leaf
+ * becomes too small in which case we may move items from the first leaf to the
+ * second leaf) - this causes a full fsync to not copy any extent items and
+ * therefore not detect and log a hole starting at offset 0. We check for such
+ * scenario here and log the hole.
+ */
+static int btrfs_log_leading_hole(struct btrfs_trans_handle *trans,
+				  struct btrfs_root *root,
+				  struct btrfs_inode *inode,
+				  struct btrfs_path *path)
+{
+	struct btrfs_key key;
+	const u64 ino = btrfs_ino(inode);
+	u64 hole_len;
+	int ret;
+
+	if (!btrfs_fs_incompat(root->fs_info, NO_HOLES))
+		return 0;
+
+	if (i_size_read(&inode->vfs_inode) == 0)
+		return 0;
+
+	key.objectid = ino;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = 0;
+
+	/*
+	 * If there's an extent item at file offset 0 we don't have anything
+	 * to do, just return.
+	 */
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret <= 0)
+		return ret;
+
+	/*
+	 * Figure out the size of the hole. If there is no next extent, just
+	 * exit without doing anything (if the file consists of a single hole,
+	 * has no extents, we log the hole at btrfs_log_trailing_hole()).
+	 */
+	if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+		ret = btrfs_next_leaf(root, path);
+		if (ret < 0)
+			return ret;
+		else if (ret > 0)
+			return 0;
+	}
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
+		return 0;
+
+	hole_len = key.offset;
+	btrfs_release_path(path);
+	ret = btrfs_insert_file_extent(trans, root->log_root, ino, 0, 0, 0,
+				       hole_len, 0, hole_len, 0, 0, 0);
+	return ret;
+}
+
+/*
  * When we are logging a new inode X, check if it doesn't have a reference that
  * matches the reference from some other inode Y created in a past transaction
  * and that was renamed in the current transaction. If we don't do this, then at
@@ -5393,6 +5455,18 @@  static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 	if (err)
 		goto out_unlock;
 	xattrs_logged = true;
+	/*
+	 * If we are doing a full fsync and didn't copy any extent items, we
+	 * may need to log a leading hole extent item when using the NO_HOLES
+	 * feature, otherwise we end up not persisting the hole.
+	 */
+	if (!fast_search && last_extent == 0) {
+		btrfs_release_path(path);
+		btrfs_release_path(dst_path);
+		err = btrfs_log_leading_hole(trans, root, inode, path);
+		if (err)
+			goto out_unlock;
+	}
 	if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) {
 		btrfs_release_path(path);
 		btrfs_release_path(dst_path);