Message ID | 20190304140612.23950-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: fix incorrect file size after shrinking truncate and fsync | expand |
On Mon, Mar 04, 2019 at 02:06:12PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > If we do a shrinking truncate against an inode which is already present > in the respective log tree and then rename it, as part of logging the new > name we end up logging an inode item that reflects the old size of the > file (the one which we previously logged) and not the new smaller size. > The decision to preserve the size previously logged was added by commit > 1a4bcf470c886b ("Btrfs: fix fsync data loss after adding hard link to > inode") in order to avoid data loss after replaying the log. However that > decision is only needed for the case the logged inode size is smaller then > the current size of the inode, as explained in that commit's change log. > If the current size of the inode is smaller then the previously logged > size, we know a shrinking truncate happened and therefore need to use > that smaller size. > > Example to trigger the problem: > > $ mkfs.btrfs -f /dev/sdb > $ mount /dev/sdb /mnt > > $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo > $ xfs_io -c "fsync" /mnt/foo > $ xfs_io -c "truncate 3000" /mnt/foo > > $ mv /mnt/foo /mnt/bar > $ xfs_io -c "fsync" /mnt/bar > > <power failure> > > $ mount /dev/sdb /mnt > $ od -t x1 -A d /mnt/bar > 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab > * > 0008000 > > Once we rename the file, we log its name (and inode item), and because > the inode was already logged before in the current transaction, we log it > with a size of 8000 bytes because that is the size we previously logged > (with the first fsync). As part of the rename, besides logging the inode, > we do also sync the log, which is done since commit d4682ba03ef618 > ("Btrfs: sync log after logging new name"), so the next fsync against our > inode is effectively a no-op, since no new changes happened since the > rename operation. Even if did not sync the log during the rename > operation, the same problem (fize size of 8000 bytes instead of 3000 > bytes) would be visible after replaying the log if the log ended up > getting synced to disk through some other means, such as for example by > fsyncing some other modified file. In the example above the fsync after > the rename operation is there just because not every filesystem may > guarantee logging/journalling the inode (and syncing the log/journal) > during the rename operation, for example it is needed for f2fs, but not > for ext4 and xfs. > > Fix this scenario by, when logging a new name (which is triggered by > rename and link operations), using the current size of the inode instead > of the previously logged inode size. > > A test case for fstests follows soon. > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695 > Reported-by: Seulbae Kim <seulbae@gatech.edu> > Signed-off-by: Filipe Manana <fdmanana@suse.com> Added to misc-next, thanks.
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 81a357a32837..41b66ea4060c 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4549,6 +4549,19 @@ static int logged_inode_size(struct btrfs_root *log, struct btrfs_inode *inode, item = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_inode_item); *size_ret = btrfs_inode_size(path->nodes[0], item); + /* + * If the in-memory inode's i_size is smaller then the inode + * size stored in the btree, return the inode's i_size, so + * that we get a correct inode size after replaying the log + * when before a power failure we had a shrinking truncate + * followed by addition of a new name (rename / new hard link). + * Otherwise return the inode size from the btree, to avoid + * data loss when replaying a log due to previously doing a + * write that expands the inode's size and logging a new name + * immediately after. + */ + if (*size_ret > inode->vfs_inode.i_size) + *size_ret = inode->vfs_inode.i_size; } btrfs_release_path(path);