Message ID | 20190607102524.32655-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] Btrfs: fix data loss after inode eviction, renaming it, and fsync it | expand |
On Fri, Jun 07, 2019 at 11:25:24AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > When we log an inode, regardless of logging it completely or only that it > exists, we always update it as logged (logged_trans and last_log_commit > fields of the inode are updated). This is generally fine and avoids future > attempts to log it from having to do repeated work that brings no value. > > However, if we write data to a file, then evict its inode after all the > dealloc was flushed (and ordered extents completed), rename the file and > fsync it, we end up not logging the new extents, since the rename may > result in logging that the inode exists in case the parent directory was > logged before. The following reproducer shows and explains how this can > happen: > > $ mkfs.btrfs -f /dev/sdb > $ mount /dev/sdb /mnt > > $ mkdir /mnt/dir > $ touch /mnt/dir/foo > $ touch /mnt/dir/bar > > # Do a direct IO write instead of a buffered write because with a > # buffered write we would need to make sure dealloc gets flushed and > # complete before we do the inode eviction later, and we can not do that > # from user space with call to things such as sync(2) since that results > # in a transaction commit as well. > $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar > > # Keep the directory dir in use while we evict inodes. We want our file > # bar's inode to be evicted but we don't want our directory's inode to > # be evicted (if it were evicted too, we would not be able to reproduce > # the issue since the first fsync below, of file foo, would result in a > # transaction commit. > $ ( cd /mnt/dir; while true; do :; done ) & > $ pid=$! > > # Wait a bit to give time for the background process to chdir. > $ sleep 0.1 > > # Evict all inodes, except the inode for the directory dir because it is > # currently in use by our background process. > $ echo 2 > /proc/sys/vm/drop_caches > > # fsync file foo, which ends up persisting information about the parent > # directory because it is a new inode. > $ xfs_io -c fsync /mnt/dir/foo > > # Rename bar, this results in logging that this inode exists (inode item, > # names, xattrs) because the parent directory is in the log. > $ mv /mnt/dir/bar /mnt/dir/baz > > # Now fsync baz, which ends up doing absolutely nothing because of the > # rename operation which logged that the inode exists only. > $ xfs_io -c fsync /mnt/dir/baz > > <power failure> > > $ mount /dev/sdb /mnt > $ od -t x1 -A d /mnt/dir/baz > 0000000 > > --> Empty file, data we wrote is missing. > > Fix this by not updating last_sub_trans of an inode when we are logging > only that it exists and the inode was not yet logged since it was loaded > from disk (full_sync bit set), this is enough to make btrfs_inode_in_log() > return false for this scenario and make us log the inode. The logged_trans > of the inode is still always setsince that alone is used to track if names > need to be deleted as part of unlink operations. > > Fixes: 257c62e1bce03e ("Btrfs: avoid tree log commit when there are no changes") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Added to misc-next, thanks.
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 83755d3b96e3..40c839161596 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5407,9 +5407,19 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, } } + /* + * Don't update last_log_commit if we logged that an inode exists after + * it was loaded to memory (full_sync bit set). + * This is to prevent data loss when we do a write to the inode, then + * the inode gets evicted after all delalloc was flushed, then we log + * it exists (due to a rename for example) and then fsync it. This last + * fsync would do nothing (not logging the extents previously written). + */ spin_lock(&inode->lock); inode->logged_trans = trans->transid; - inode->last_log_commit = inode->last_sub_trans; + if (inode_only != LOG_INODE_EXISTS || + !test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags)) + inode->last_log_commit = inode->last_sub_trans; spin_unlock(&inode->lock); out_unlock: mutex_unlock(&inode->log_mutex);