diff mbox series

Btrfs: fix incorrect file size after shrinking truncate and fsync

Message ID 20190304140612.23950-1-fdmanana@kernel.org (mailing list archive)
State New, archived
Headers show
Series Btrfs: fix incorrect file size after shrinking truncate and fsync | expand

Commit Message

Filipe Manana March 4, 2019, 2:06 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

If we do a shrinking truncate against an inode which is already present
in the respective log tree and then rename it, as part of logging the new
name we end up logging an inode item that reflects the old size of the
file (the one which we previously logged) and not the new smaller size.
The decision to preserve the size previously logged was added by commit
1a4bcf470c886b ("Btrfs: fix fsync data loss after adding hard link to
inode") in order to avoid data loss after replaying the log. However that
decision is only needed for the case the logged inode size is smaller then
the current size of the inode, as explained in that commit's change log.
If the current size of the inode is smaller then the previously logged
size, we know a shrinking truncate happened and therefore need to use
that smaller size.

Example to trigger the problem:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
  $ xfs_io -c "fsync" /mnt/foo
  $ xfs_io -c "truncate 3000" /mnt/foo

  $ mv /mnt/foo /mnt/bar
  $ xfs_io -c "fsync" /mnt/bar

  <power failure>

  $ mount /dev/sdb /mnt
  $ od -t x1 -A d /mnt/bar
  0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
  *
  0008000

Once we rename the file, we log its name (and inode item), and because
the inode was already logged before in the current transaction, we log it
with a size of 8000 bytes because that is the size we previously logged
(with the first fsync). As part of the rename, besides logging the inode,
we do also sync the log, which is done since commit d4682ba03ef618
("Btrfs: sync log after logging new name"), so the next fsync against our
inode is effectively a no-op, since no new changes happened since the
rename operation. Even if did not sync the log during the rename
operation, the same problem (fize size of 8000 bytes instead of 3000
bytes) would be visible after replaying the log if the log ended up
getting synced to disk through some other means, such as for example by
fsyncing some other modified file. In the example above the fsync after
the rename operation is there just because not every filesystem may
guarantee logging/journalling the inode (and syncing the log/journal)
during the rename operation, for example it is needed for f2fs, but not
for ext4 and xfs.

Fix this scenario by, when logging a new name (which is triggered by
rename and link operations), using the current size of the inode instead
of the previously logged inode size.

A test case for fstests follows soon.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
Reported-by: Seulbae Kim <seulbae@gatech.edu>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

David Sterba March 8, 2019, 12:45 p.m. UTC | #1
On Mon, Mar 04, 2019 at 02:06:12PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If we do a shrinking truncate against an inode which is already present
> in the respective log tree and then rename it, as part of logging the new
> name we end up logging an inode item that reflects the old size of the
> file (the one which we previously logged) and not the new smaller size.
> The decision to preserve the size previously logged was added by commit
> 1a4bcf470c886b ("Btrfs: fix fsync data loss after adding hard link to
> inode") in order to avoid data loss after replaying the log. However that
> decision is only needed for the case the logged inode size is smaller then
> the current size of the inode, as explained in that commit's change log.
> If the current size of the inode is smaller then the previously logged
> size, we know a shrinking truncate happened and therefore need to use
> that smaller size.
> 
> Example to trigger the problem:
> 
>   $ mkfs.btrfs -f /dev/sdb
>   $ mount /dev/sdb /mnt
> 
>   $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
>   $ xfs_io -c "fsync" /mnt/foo
>   $ xfs_io -c "truncate 3000" /mnt/foo
> 
>   $ mv /mnt/foo /mnt/bar
>   $ xfs_io -c "fsync" /mnt/bar
> 
>   <power failure>
> 
>   $ mount /dev/sdb /mnt
>   $ od -t x1 -A d /mnt/bar
>   0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
>   *
>   0008000
> 
> Once we rename the file, we log its name (and inode item), and because
> the inode was already logged before in the current transaction, we log it
> with a size of 8000 bytes because that is the size we previously logged
> (with the first fsync). As part of the rename, besides logging the inode,
> we do also sync the log, which is done since commit d4682ba03ef618
> ("Btrfs: sync log after logging new name"), so the next fsync against our
> inode is effectively a no-op, since no new changes happened since the
> rename operation. Even if did not sync the log during the rename
> operation, the same problem (fize size of 8000 bytes instead of 3000
> bytes) would be visible after replaying the log if the log ended up
> getting synced to disk through some other means, such as for example by
> fsyncing some other modified file. In the example above the fsync after
> the rename operation is there just because not every filesystem may
> guarantee logging/journalling the inode (and syncing the log/journal)
> during the rename operation, for example it is needed for f2fs, but not
> for ext4 and xfs.
> 
> Fix this scenario by, when logging a new name (which is triggered by
> rename and link operations), using the current size of the inode instead
> of the previously logged inode size.
> 
> A test case for fstests follows soon.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
> Reported-by: Seulbae Kim <seulbae@gatech.edu>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Added to misc-next, thanks.
diff mbox series

Patch

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 81a357a32837..41b66ea4060c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4549,6 +4549,19 @@  static int logged_inode_size(struct btrfs_root *log, struct btrfs_inode *inode,
 		item = btrfs_item_ptr(path->nodes[0], path->slots[0],
 				      struct btrfs_inode_item);
 		*size_ret = btrfs_inode_size(path->nodes[0], item);
+		/*
+		 * If the in-memory inode's i_size is smaller then the inode
+		 * size stored in the btree, return the inode's i_size, so
+		 * that we get a correct inode size after replaying the log
+		 * when before a power failure we had a shrinking truncate
+		 * followed by addition of a new name (rename / new hard link).
+		 * Otherwise return the inode size from the btree, to avoid
+		 * data loss when replaying a log due to previously doing a
+		 * write that expands the inode's size and logging a new name
+		 * immediately after.
+		 */
+		if (*size_ret > inode->vfs_inode.i_size)
+			*size_ret = inode->vfs_inode.i_size;
 	}
 
 	btrfs_release_path(path);