Message ID | 20181128145428.22944-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] Btrfs: fix fsync of files with multiple hard links in new directories | expand |
On Wed, Nov 28, 2018 at 02:54:28PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > The log tree has a long standing problem that when a file is fsync'ed we > only check for new ancestors, created in the current transaction, by > following only the hard link for which the fsync was issued. We follow the > ancestors using the VFS' dget_parent() API. This means that if we create a > new link for a file in a directory that is new (or in an any other new > ancestor directory) and then fsync the file using an old hard link, we end > up not logging the new ancestor, and on log replay that new hard link and > ancestor do not exist. In some cases, involving renames, the file will not > exist at all. > > Example: > > mkfs.btrfs -f /dev/sdb > mount /dev/sdb /mnt > > mkdir /mnt/A > touch /mnt/foo > ln /mnt/foo /mnt/A/bar > xfs_io -c fsync /mnt/foo > > <power failure> > > In this example after log replay only the hard link named 'foo' exists > and directory A does not exist, which is unexpected. In other major linux > filesystems, such as ext4, xfs and f2fs for example, both hard links exist > and so does directory A after mounting again the filesystem. > > Checking if any new ancestors are new and need to be logged was added in > 2009 by commit 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes"), > however only for the ancestors of the hard link (dentry) for which the > fsync was issued, instead of checking for all ancestors for all of the > inode's hard links. > > So fix this by tracking the id of the last transaction where a hard link > was created for an inode and then on fsync fallback to a full transaction > commit when an inode has more than one hard link and at least one new hard > link was created in the current transaction. This is the simplest solution > since this is not a common use case (adding frequently hard links for > which there's an ancestor created in the current transaction and then > fsync the file). In case it ever becomes a common use case, a solution > that consists of iterating the fs/subvol btree for each hard link and > check if any ancestor is new, could be implemented. > > This solves many unexpected scenarios reported by Jayashree Mohan and > Vijay Chidambaram, and for which there is a new test case for fstests > under review. > > Reported-by: Vijay Chidambaram <vvijay03@gmail.com> > Reported-by: Jayashree Mohan <jayashree2912@gmail.com> > Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Added to misc-next, thanks.
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 1343ac57b438..7177d1d33584 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -147,6 +147,12 @@ struct btrfs_inode { u64 last_unlink_trans; /* + * Track the transaction id of the last transaction used to create a + * hard link for the inode. This is used by the log tree (fsync). + */ + u64 last_link_trans; + + /* * Number of bytes outstanding that are going to need csums. This is * used in ENOSPC accounting. */ diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 64ea749c1ba4..51f4628be2d2 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3685,6 +3685,21 @@ static int btrfs_read_locked_inode(struct inode *inode, * inode is not a directory, logging its parent unnecessarily. */ BTRFS_I(inode)->last_unlink_trans = BTRFS_I(inode)->last_trans; + /* + * Similar reasoning for last_link_trans, needs to be set otherwise + * for a case like the following: + * + * mkdir A + * touch foo + * ln foo A/bar + * echo 2 > /proc/sys/vm/drop_caches + * fsync foo + * <power failure> + * + * Would result in link bar and directory A not existing after the power + * failure. + */ + BTRFS_I(inode)->last_link_trans = BTRFS_I(inode)->last_trans; path->slots[0]++; if (inode->i_nlink != 1 || @@ -6651,6 +6666,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir, if (err) goto fail; } + BTRFS_I(inode)->last_link_trans = trans->transid; d_instantiate(dentry, inode); ret = btrfs_log_new_name(trans, BTRFS_I(inode), NULL, parent, true, NULL); @@ -9179,6 +9195,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) ei->index_cnt = (u64)-1; ei->dir_index = 0; ei->last_unlink_trans = 0; + ei->last_link_trans = 0; ei->last_log_commit = 0; spin_lock_init(&ei->lock); diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index aac3749f697f..896d79144052 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5760,6 +5760,22 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, goto end_trans; } + /* + * If a new hard link was added to the inode in the current transaction + * and its link count is now greater than 1, we need to fallback to a + * transaction commit, otherwise we can end up not logging all its new + * parents for all the hard links. Here just from the dentry used to + * fsync, we can not visit the ancestor inodes for all the other hard + * links to figure out if any is new, so we fallback to a transaction + * commit (instead of adding a lot of complexity of scanning a btree, + * since this scenario is not a common use case). + */ + if (inode->vfs_inode.i_nlink > 1 && + inode->last_link_trans > last_committed) { + ret = -EMLINK; + goto end_trans; + } + while (1) { if (!parent || d_really_is_negative(parent) || sb != parent->d_sb) break;