diff mbox series

btrfs: detect nocow for swap after snapshot delete

Message ID 20200818180005.933061-1-boris@bur.io (mailing list archive)
State New, archived
Headers show
Series btrfs: detect nocow for swap after snapshot delete | expand

Commit Message

Boris Burkov Aug. 18, 2020, 6 p.m. UTC
can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
detecting a must cow condition which is not exactly accurate, but saves
unnecessary tree traversal. The incorrect assumption is that if the
extent was created in a generation smaller than the last snapshot
generation, it must be referenced by that snapshot. That is true, except
the snapshot could have since been deleted, without affecting the last
snapshot generation.

The original patch claimed a performance win from this check, but it
also leads to a bug where you are unable to use a swapfile if you ever
snapshotted the subvolume it's in. Make the check slower and more strict
for the swapon case, without modifying the general cow checks as a
compromise. Turning swap on does not seem to be a particularly
performance sensitive operation, so incurring a possibly unnecessary
btrfs_search_slot seems worthwhile for the added usability.

Note: Until the snapshot delete transaction is committed,
check_committed_refs will still cause the logic to think that cow is
necessary, so the user must sync before swapon.

Signed-off-by: Boris Burkov <boris@bur.io>
Suggested-by: Omar Sandoval <osandov@osandov.com>
---
 fs/btrfs/ctree.h       |  4 ++--
 fs/btrfs/extent-tree.c | 17 +++++++++++------
 fs/btrfs/file.c        |  2 +-
 fs/btrfs/inode.c       | 18 +++++++++++-------
 4 files changed, 25 insertions(+), 16 deletions(-)

Comments

David Sterba Aug. 19, 2020, 11:29 a.m. UTC | #1
On Tue, Aug 18, 2020 at 11:00:05AM -0700, Boris Burkov wrote:
> can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
> detecting a must cow condition which is not exactly accurate, but saves
> unnecessary tree traversal. The incorrect assumption is that if the
> extent was created in a generation smaller than the last snapshot
> generation, it must be referenced by that snapshot. That is true, except
> the snapshot could have since been deleted, without affecting the last
> snapshot generation.
> 
> The original patch claimed a performance win from this check, but it
> also leads to a bug where you are unable to use a swapfile if you ever
> snapshotted the subvolume it's in.

Do you mean snapshotted and deleted? As I understand it:

- create fs
- create swapfile in a subvolume
- activate, use, deactivate it
- create snapshot, maybe changing the swapfile
- delete the snapshot, wait until it's gone
- swapfile activation fails, although there are no shared extents and
  the state is effectively the same as if it were created from scratch

> Make the check slower and more strict
> for the swapon case, without modifying the general cow checks as a
> compromise. Turning swap on does not seem to be a particularly
> performance sensitive operation, so incurring a possibly unnecessary
> btrfs_search_slot seems worthwhile for the added usability.

Yeah slowdown should be acceptable here, it's a one time action going
over some metadata.

> Note: Until the snapshot delete transaction is committed,
> check_committed_refs will still cause the logic to think that cow is
> necessary, so the user must sync before swapon.

How often could the snapshot deletion and swapfile activation happen at
the same time? Snapshotting subvolume with the swapfile requires
deactivation, snapshot/send/whatever and then activation. This sounds
like a realistic usecase.
Graham Cobb Aug. 19, 2020, 11:59 a.m. UTC | #2
On 19/08/2020 12:29, David Sterba wrote:
> How often could the snapshot deletion and swapfile activation happen at
> the same time? Snapshotting subvolume with the swapfile requires
> deactivation, snapshot/send/whatever and then activation. This sounds
> like a realistic usecase.

It is very likely when the swapfile is one that is only used
occasionally (for example, when running a particular program which needs
a massive amount of virtual memory, or having to stop using a different
swapfile because a disk looks like it is starting to fail).

If the swapfile is not normally used, it is not unlikely it got
snapshotted (as part of a larger operation, presumably) while
deactivated. When the user tries to use it, they realise it isn't
working because it is snapshotted, so they delete the snapshot and then
immediately try to activate it again -- causing confusion when it still
fails.
David Sterba Aug. 19, 2020, 2:25 p.m. UTC | #3
On Wed, Aug 19, 2020 at 12:59:28PM +0100, Graham Cobb wrote:
> On 19/08/2020 12:29, David Sterba wrote:
> > How often could the snapshot deletion and swapfile activation happen at
> > the same time? Snapshotting subvolume with the swapfile requires
> > deactivation, snapshot/send/whatever and then activation. This sounds
> > like a realistic usecase.
> 
> It is very likely when the swapfile is one that is only used
> occasionally (for example, when running a particular program which needs
> a massive amount of virtual memory, or having to stop using a different
> swapfile because a disk looks like it is starting to fail).
> 
> If the swapfile is not normally used, it is not unlikely it got
> snapshotted (as part of a larger operation, presumably) while
> deactivated. When the user tries to use it, they realise it isn't
> working because it is snapshotted, so they delete the snapshot and then
> immediately try to activate it again -- causing confusion when it still
> fails.

That makes sense from user POV. I still don't uderstand if it's
sufficient to commit the transaction deleting the snapshot or if it's
necessary to wait until the subvolume is completely cleaned.

The former would require 'btrfs subvol delte -c /snapshot' while the
latter needs the id of the subvolume and then
'btrfs subvol sync /path id'.
Boris Burkov Aug. 19, 2020, 5:46 p.m. UTC | #4
On Wed, Aug 19, 2020 at 04:25:48PM +0200, David Sterba wrote:
> On Wed, Aug 19, 2020 at 12:59:28PM +0100, Graham Cobb wrote:
> > On 19/08/2020 12:29, David Sterba wrote:
> > > How often could the snapshot deletion and swapfile activation happen at
> > > the same time? Snapshotting subvolume with the swapfile requires
> > > deactivation, snapshot/send/whatever and then activation. This sounds
> > > like a realistic usecase.
> > 
> > It is very likely when the swapfile is one that is only used
> > occasionally (for example, when running a particular program which needs
> > a massive amount of virtual memory, or having to stop using a different
> > swapfile because a disk looks like it is starting to fail).
> > 
> > If the swapfile is not normally used, it is not unlikely it got
> > snapshotted (as part of a larger operation, presumably) while
> > deactivated. When the user tries to use it, they realise it isn't
> > working because it is snapshotted, so they delete the snapshot and then
> > immediately try to activate it again -- causing confusion when it still
> > fails.
> 
> That makes sense from user POV. I still don't uderstand if it's
> sufficient to commit the transaction deleting the snapshot or if it's
> necessary to wait until the subvolume is completely cleaned.
> 
> The former would require 'btrfs subvol delte -c /snapshot' while the
> latter needs the id of the subvolume and then
> 'btrfs subvol sync /path id'.

My reproduction has been:
create subvol
create swapfile in subvol
btrfs subvol snapshot subvol snapshot
btrfs subvol delete snapshot
sync/btrfs fi sync/waiting
swapon subvol/swapfile

Note that I haven't been touching the swapfile in any way after
initially creating it. Nor have I been turning swap on/off in any
coordinated way before/after the snapshot except to test after the
snapshot is deleted.

I tried both suggestions with and without the patch and saw that with
the patch, swapon reliably succeeds after 'btrfs subvol sync' and does
not reliably succeed even after 'btrfs subvol delete -c' or sync calls,
so it seems that we need the subvolume to be completely cleaned. Without
the patch, after both 'btrfs subvol delete -c' and after 'btrfs subvol
sync', the swapon fails. Thanks for the review, and testing suggestions!
David Sterba Aug. 20, 2020, 4:13 p.m. UTC | #5
On Wed, Aug 19, 2020 at 10:46:09AM -0700, Boris Burkov wrote:
> My reproduction has been:
> create subvol
> create swapfile in subvol
> btrfs subvol snapshot subvol snapshot
> btrfs subvol delete snapshot
> sync/btrfs fi sync/waiting
> swapon subvol/swapfile
> 
> Note that I haven't been touching the swapfile in any way after
> initially creating it. Nor have I been turning swap on/off in any
> coordinated way before/after the snapshot except to test after the
> snapshot is deleted.

Ok, so this is the minimal testcase.

> I tried both suggestions with and without the patch and saw that with
> the patch, swapon reliably succeeds after 'btrfs subvol sync' and does
> not reliably succeed even after 'btrfs subvol delete -c' or sync calls,
> so it seems that we need the subvolume to be completely cleaned. Without
> the patch, after both 'btrfs subvol delete -c' and after 'btrfs subvol
> sync', the swapon fails.

The patch is fine as is so I'll add it to misc-next and tag for stable.
The remaining part is to document in manual page what to do in case the
snapshot was accidentaly created and needs to be deleted.
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9c7e466f27a9..1d3162cc8fc9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2518,7 +2518,7 @@  int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
 				    u64 bytenr, u64 num_bytes);
 int btrfs_exclude_logged_extents(struct extent_buffer *eb);
 int btrfs_cross_ref_exist(struct btrfs_root *root,
-			  u64 objectid, u64 offset, u64 bytenr);
+			  u64 objectid, u64 offset, u64 bytenr, bool strict);
 struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 					     struct btrfs_root *root,
 					     u64 parent, u64 root_objectid,
@@ -2934,7 +2934,7 @@  struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
 					   u64 start, u64 len);
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 			      u64 *orig_start, u64 *orig_block_len,
-			      u64 *ram_bytes);
+			      u64 *ram_bytes, bool strict);
 
 void __btrfs_del_delalloc_inode(struct btrfs_root *root,
 				struct btrfs_inode *inode);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 597505df90b4..096d20fab32f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2306,7 +2306,8 @@  static noinline int check_delayed_ref(struct btrfs_root *root,
 
 static noinline int check_committed_ref(struct btrfs_root *root,
 					struct btrfs_path *path,
-					u64 objectid, u64 offset, u64 bytenr)
+					u64 objectid, u64 offset, u64 bytenr,
+					bool strict)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *extent_root = fs_info->extent_root;
@@ -2348,9 +2349,13 @@  static noinline int check_committed_ref(struct btrfs_root *root,
 	    btrfs_extent_inline_ref_size(BTRFS_EXTENT_DATA_REF_KEY))
 		goto out;
 
-	/* If extent created before last snapshot => it's definitely shared */
-	if (btrfs_extent_generation(leaf, ei) <=
-	    btrfs_root_last_snapshot(&root->root_item))
+	/*
+	 * If extent created before last snapshot => it's shared unless the
+	 * snapshot has been deleted. Use the heuristic if strict is false.
+	 */
+	if (!strict &&
+	    (btrfs_extent_generation(leaf, ei) <=
+	     btrfs_root_last_snapshot(&root->root_item)))
 		goto out;
 
 	iref = (struct btrfs_extent_inline_ref *)(ei + 1);
@@ -2375,7 +2380,7 @@  static noinline int check_committed_ref(struct btrfs_root *root,
 }
 
 int btrfs_cross_ref_exist(struct btrfs_root *root, u64 objectid, u64 offset,
-			  u64 bytenr)
+			  u64 bytenr, bool strict)
 {
 	struct btrfs_path *path;
 	int ret;
@@ -2386,7 +2391,7 @@  int btrfs_cross_ref_exist(struct btrfs_root *root, u64 objectid, u64 offset,
 
 	do {
 		ret = check_committed_ref(root, path, objectid,
-					  offset, bytenr);
+					  offset, bytenr, strict);
 		if (ret && ret != -ENOENT)
 			goto out;
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index bb824c7cb7c7..4507c3d09399 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1571,7 +1571,7 @@  static int check_can_nocow(struct btrfs_inode *inode, loff_t pos,
 	}
 
 	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
-			NULL, NULL, NULL);
+			NULL, NULL, NULL, false);
 	if (ret <= 0) {
 		ret = 0;
 		if (!nowait)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8c49168a1b48..df08ace5d914 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1610,7 +1610,7 @@  static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 				goto out_check;
 			ret = btrfs_cross_ref_exist(root, ino,
 						    found_key.offset -
-						    extent_offset, disk_bytenr);
+						    extent_offset, disk_bytenr, false);
 			if (ret) {
 				/*
 				 * ret could be -EIO if the above fails to read
@@ -6949,6 +6949,8 @@  static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
  * @orig_start:	(optional) Return the original file offset of the file extent
  * @orig_len:	(optional) Return the original on-disk length of the file extent
  * @ram_bytes:	(optional) Return the ram_bytes of the file extent
+ * @strict:	if true, omit optimizations that might force us into unnecessary
+ *		cow. e.g., don't trust generation number.
  *
  * This function will flush ordered extents in the range to ensure proper
  * nocow checks for (nowait == false) case.
@@ -6963,7 +6965,7 @@  static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
  */
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 			      u64 *orig_start, u64 *orig_block_len,
-			      u64 *ram_bytes)
+			      u64 *ram_bytes, bool strict)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_path *path;
@@ -7041,8 +7043,9 @@  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 	 * Do the same check as in btrfs_cross_ref_exist but without the
 	 * unnecessary search.
 	 */
-	if (btrfs_file_extent_generation(leaf, fi) <=
-	    btrfs_root_last_snapshot(&root->root_item))
+	if (!strict &&
+	    (btrfs_file_extent_generation(leaf, fi) <=
+	     btrfs_root_last_snapshot(&root->root_item)))
 		goto out;
 
 	backref_offset = btrfs_file_extent_offset(leaf, fi);
@@ -7078,7 +7081,8 @@  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 	 */
 
 	ret = btrfs_cross_ref_exist(root, btrfs_ino(BTRFS_I(inode)),
-				    key.offset - backref_offset, disk_bytenr);
+				    key.offset - backref_offset, disk_bytenr,
+				    strict);
 	if (ret) {
 		ret = 0;
 		goto out;
@@ -7299,7 +7303,7 @@  static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		block_start = em->block_start + (start - em->start);
 
 		if (can_nocow_extent(inode, start, &len, &orig_start,
-				     &orig_block_len, &ram_bytes) == 1 &&
+				     &orig_block_len, &ram_bytes, false) == 1 &&
 		    btrfs_inc_nocow_writers(fs_info, block_start)) {
 			struct extent_map *em2;
 
@@ -10130,7 +10134,7 @@  static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		free_extent_map(em);
 		em = NULL;
 
-		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL);
+		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL, true);
 		if (ret < 0) {
 			goto out;
 		} else if (ret) {