Btrfs: fix loss of prealloc extents past i_size after fsync log replay
diff mbox

Message ID 20180405215512.513-1-fdmanana@kernel.org
State New
Headers show

Commit Message

Filipe Manana April 5, 2018, 9:55 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Currently if we allocate extents beyond an inode's i_size (through the
fallocate system call) and then fsync the file, we log the extents but
after a power failure we replay them and then immediately drop them.
This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs:
Avoid orphan inodes cleanup while replaying log"), because it marks
the inode as an orphan instead of dropping any extents beyond i_size
before replaying logged extents, so after the log replay, and while
the mount operation is still ongoing, we find the inode marked as an
orphan and then perform a truncation (drop extents beyond the inode's
i_size). Because the processing of orphan inodes is still done
right after replaying the log and before the mount operation finishes,
the intention of that commit does not make any sense (at least as
of today). However reverting that behaviour is not enough, because
we can not simply discard all extents beyond i_size and then replay
logged extents, because we risk dropping extents beyond i_size created
in past transactions, for example:

  add prealloc extent beyond i_size
  fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
  transaction commit
  add another prealloc extent beyond i_size
  fsync - triggers the fast fsync path
  power failure

In that scenario, we would drop the first extent and then replay the
second one. To fix this just make sure that all prealloc extents
beyond i_size are logged, and if we find too many (which is far from
a common case), fallback to a full transaction commit (like we do when
logging regular extents in the fast fsync path).

Trivial reproducer:

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt
 $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
 $ sync
 $ xfs_io -c "falloc -k 256K 1M" /mnt/foo
 $ xfs_io -c "fsync" /mnt/foo
 <power failure>

 # mount to replay log
 $ mount /dev/sdb /mnt
 # at this point the file only has one extent, at offset 0, size 256K

A test case for fstests follows soon, covering multiple scenarios that
involve adding prealloc extents with previous shrinking truncates and
without such truncates.

Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 58 insertions(+), 5 deletions(-)

Comments

David Sterba April 6, 2018, 1:38 p.m. UTC | #1
On Thu, Apr 05, 2018 at 10:55:12PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Currently if we allocate extents beyond an inode's i_size (through the
> fallocate system call) and then fsync the file, we log the extents but
> after a power failure we replay them and then immediately drop them.
> This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs:
> Avoid orphan inodes cleanup while replaying log"), because it marks
> the inode as an orphan instead of dropping any extents beyond i_size
> before replaying logged extents, so after the log replay, and while
> the mount operation is still ongoing, we find the inode marked as an
> orphan and then perform a truncation (drop extents beyond the inode's
> i_size). Because the processing of orphan inodes is still done
> right after replaying the log and before the mount operation finishes,
> the intention of that commit does not make any sense (at least as
> of today). However reverting that behaviour is not enough, because
> we can not simply discard all extents beyond i_size and then replay
> logged extents, because we risk dropping extents beyond i_size created
> in past transactions, for example:
> 
>   add prealloc extent beyond i_size
>   fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
>   transaction commit
>   add another prealloc extent beyond i_size
>   fsync - triggers the fast fsync path
>   power failure
> 
> In that scenario, we would drop the first extent and then replay the
> second one. To fix this just make sure that all prealloc extents
> beyond i_size are logged, and if we find too many (which is far from
> a common case), fallback to a full transaction commit (like we do when
> logging regular extents in the fast fsync path).
> 
> Trivial reproducer:
> 
>  $ mkfs.btrfs -f /dev/sdb
>  $ mount /dev/sdb /mnt
>  $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
>  $ sync
>  $ xfs_io -c "falloc -k 256K 1M" /mnt/foo
>  $ xfs_io -c "fsync" /mnt/foo
>  <power failure>
> 
>  # mount to replay log
>  $ mount /dev/sdb /mnt
>  # at this point the file only has one extent, at offset 0, size 256K
> 
> A test case for fstests follows soon, covering multiple scenarios that
> involve adding prealloc extents with previous shrinking truncates and
> without such truncates.
> 
> Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba April 12, 2018, 2:50 p.m. UTC | #2
On Tue, Apr 10, 2018 at 01:50:21PM +0000, Sasha Levin wrote:
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: c71bf099abdd Btrfs: Avoid orphan inodes cleanup while replaying log.
> 
> The bot has also determined it's probably a bug fixing patch. (score: 6.2138)
> 
> The bot has tested the following trees: v4.16.1, v4.15.16, v4.14.33, v4.9.93, v4.4.127.
> 
> v4.16.1: Build OK!
> v4.15.16: Build OK!
> v4.14.33: Build OK!
> v4.9.93: Build failed! Errors:
>     tree-log.c:2367:24: error: ‘struct btrfs_fs_info’ has no member named ‘sectorsize’
>     tree-log.c:2367:24: error: ‘struct btrfs_fs_info’ has no member named ‘sectorsize’
>     tree-log.c:4224:13: error: ‘struct inode’ has no member named ‘flags’; did you mean ‘i_flags’?
>     tree-log.c:4229:38: error: ‘struct inode’ has no member named ‘vfs_inode’
>     tree-log.c:4239:4: error: implicit declaration of function ‘refcount_inc’; did you mean ‘i_readcount_inc’? [-Werror=implicit-function-declaration]

4.9 and older would need manual fixes to the patch. There are a few
refactorings but that should be easy to resolve if somebody wants to
backport port that.

> 
> v4.4.127: Failed to apply! Possible dependencies:
>     0132761017e0 ("btrfs: fix string and comment grammatical issues and typos")
> 
> 
> --
> Thanks,
> Sasha????{.n?+???????+%?????ݶ??w??{.n?+????{???k~????^n?r???z???h?????&???z??z?ޗ?+??+zf???h???~????i?????????z_???j:+v???)ߣ?m
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 70afd1085033..eb3a41269b0e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2457,13 +2457,41 @@  static int replay_one_buffer(struct btrfs_root *log, struct extent_buffer *eb,
 			if (ret)
 				break;
 
-			/* for regular files, make sure corresponding
-			 * orphan item exist. extents past the new EOF
-			 * will be truncated later by orphan cleanup.
+			/*
+			 * Before replaying extents, truncate the inode to its
+			 * size. We need to do it now and not after log replay
+			 * because before an fsync we can have prealloc extents
+			 * added beyond the inode's i_size. If we did it after,
+			 * through orphan cleanup for example, we would drop
+			 * those prealloc extents just after replaying them.
 			 */
 			if (S_ISREG(mode)) {
-				ret = insert_orphan_item(wc->trans, root,
-							 key.objectid);
+				struct inode *inode;
+				u64 from;
+
+				inode = read_one_inode(root, key.objectid);
+				if (!inode) {
+					ret = -EIO;
+					break;
+				}
+				from = ALIGN(i_size_read(inode),
+					     root->fs_info->sectorsize);
+				ret = btrfs_drop_extents(wc->trans, root, inode,
+							 from, (u64)-1, 1);
+				/*
+				 * If the nlink count is zero here, the iput
+				 * will free the inode.  We bump it to make
+				 * sure it doesn't get freed until the link
+				 * count fixup is done.
+				 */
+				if (!ret) {
+					if (inode->i_nlink == 0)
+						inc_nlink(inode);
+					/* Update link count and nbytes. */
+					ret = btrfs_update_inode(wc->trans,
+								 root, inode);
+				}
+				iput(inode);
 				if (ret)
 					break;
 			}
@@ -4348,6 +4376,31 @@  static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 		num++;
 	}
 
+	/*
+	 * Add all prealloc extents beyond the inode's i_size to make sure we
+	 * don't lose them after doing a fast fsync and replaying the log.
+	 */
+	if (inode->flags & BTRFS_INODE_PREALLOC) {
+		struct rb_node *node;
+
+		for (node = rb_last(&tree->map); node; node = rb_prev(node)) {
+			em = rb_entry(node, struct extent_map, rb_node);
+			if (em->start < i_size_read(&inode->vfs_inode))
+				break;
+			if (!list_empty(&em->list))
+				continue;
+			/* Same as above loop. */
+			if (++num > 32768) {
+				list_del_init(&tree->modified_extents);
+				ret = -EFBIG;
+				goto process;
+			}
+			refcount_inc(&em->refs);
+			set_bit(EXTENT_FLAG_LOGGING, &em->flags);
+			list_add_tail(&em->list, &extents);
+		}
+	}
+
 	list_sort(NULL, &extents, extent_cmp);
 	btrfs_get_logged_extents(inode, logged_list, logged_start, logged_end);
 	/*