Message ID | 20200115132135.23994-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: fix infinite loop during fsync after rename operations | expand |
On 1/15/20 8:21 AM, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Recently fsstress (from fstests) sporadically started to trigger an > infinite loop during fsync operations. This turned out to be because > support for the rename exchange and whiteout operations was added to > fsstress in fstests. These operations, unlike any others in fsstress, > cause file names to be reused, whence triggering this issue. However > it's not necessary to use rename exchange and rename whiteout operations > trigger this issue, simple rename operations and file creations are > enough to trigger the issue. > > The issue boils down to when we are logging inodes that conflict (that > had the name of any inode we need to log during the fsync operation), > we keep logging them even if they were already logged before, and after > that we check if there's any other inode that conflicts with them and > then add it again to the list of inodes to log. Skipping already logged > inodes fixes the issue. > > Consider the following example: > > $ mkfs.btrfs -f /dev/sdb > $ mount /dev/sdb /mnt > > $ mkdir /mnt/testdir # inode 257 > > $ touch /mnt/testdir/zz # inode 258 > $ ln /mnt/testdir/zz /mnt/testdir/zz_link > > $ touch /mnt/testdir/a # inode 259 > > $ sync > > # The following 3 renames achieve the same result as a rename exchange > # operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a). > > $ mv /mnt/testdir/a /mnt/testdir/a/tmp > $ mv /mnt/testdir/zz_link /mnt/testdir/a > $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link > > # The following rename and file creation give the same result as a > # rename whiteout operation (<rename_whiteout> zz to a2). > > $ mv /mnt/testdir/zz /mnt/testdir/a2 > $ touch /mnt/testdir/zz # inode 260 > > $ xfs_io -c fsync /mnt/testdir/zz > --> results in the infinite loop > > The following steps happen: > > 1) When logging inode 260, we find that its reference named "zz" was > used by inode 258 in the previous transaction (through the commit > root), so inode 258 is added to the list of conflicting indoes that > need to be logged; > > 2) After logging inode 258, we find that its reference named "a" was > used by inode 259 in the previous transaction, and therefore we add > inode 259 to the list of conflicting inodes to be logged; > > 3) After logging inode 259, we find that its reference named "zz_link" > was used by inode 258 in the previous transaction - we add inode 258 > to the list of conflicting inodes to log, again - we had already > logged it before at step 3. After logging it again, we find again > that inode 259 conflicts with him, and we add again 259 to the list, > etc - we end up repeating all the previous steps. > > So fix this by skipping logging of conflicting inodes that were already > logged. > > Fixes: 6b5fc433a7ad67 ("Btrfs: fix fsync after succession of renames of different files") > CC: stable@vger.kernel.org > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Wed, Jan 15, 2020 at 01:21:35PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Recently fsstress (from fstests) sporadically started to trigger an > infinite loop during fsync operations. This turned out to be because > support for the rename exchange and whiteout operations was added to > fsstress in fstests. These operations, unlike any others in fsstress, > cause file names to be reused, whence triggering this issue. However > it's not necessary to use rename exchange and rename whiteout operations > trigger this issue, simple rename operations and file creations are > enough to trigger the issue. > > The issue boils down to when we are logging inodes that conflict (that > had the name of any inode we need to log during the fsync operation), > we keep logging them even if they were already logged before, and after > that we check if there's any other inode that conflicts with them and > then add it again to the list of inodes to log. Skipping already logged > inodes fixes the issue. > > Consider the following example: > > $ mkfs.btrfs -f /dev/sdb > $ mount /dev/sdb /mnt > > $ mkdir /mnt/testdir # inode 257 > > $ touch /mnt/testdir/zz # inode 258 > $ ln /mnt/testdir/zz /mnt/testdir/zz_link > > $ touch /mnt/testdir/a # inode 259 > > $ sync > > # The following 3 renames achieve the same result as a rename exchange > # operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a). > > $ mv /mnt/testdir/a /mnt/testdir/a/tmp > $ mv /mnt/testdir/zz_link /mnt/testdir/a > $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link > > # The following rename and file creation give the same result as a > # rename whiteout operation (<rename_whiteout> zz to a2). > > $ mv /mnt/testdir/zz /mnt/testdir/a2 > $ touch /mnt/testdir/zz # inode 260 > > $ xfs_io -c fsync /mnt/testdir/zz > --> results in the infinite loop > > The following steps happen: > > 1) When logging inode 260, we find that its reference named "zz" was > used by inode 258 in the previous transaction (through the commit > root), so inode 258 is added to the list of conflicting indoes that > need to be logged; > > 2) After logging inode 258, we find that its reference named "a" was > used by inode 259 in the previous transaction, and therefore we add > inode 259 to the list of conflicting inodes to be logged; > > 3) After logging inode 259, we find that its reference named "zz_link" > was used by inode 258 in the previous transaction - we add inode 258 > to the list of conflicting inodes to log, again - we had already > logged it before at step 3. After logging it again, we find again > that inode 259 conflicts with him, and we add again 259 to the list, > etc - we end up repeating all the previous steps. > > So fix this by skipping logging of conflicting inodes that were already > logged. > > Fixes: 6b5fc433a7ad67 ("Btrfs: fix fsync after succession of renames of different files") > CC: stable@vger.kernel.org > Signed-off-by: Filipe Manana <fdmanana@suse.com> Added to misc-next, thanks.
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index f674013f4771..2b23e557c457 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4836,6 +4836,50 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, continue; } /* + * If the inode was already logged skip it - otherwise we can + * hit an infinite loop. Example: + * + * From the commit root (previous transaction) we have the + * following inodes: + * + * inode 257 a directory + * inode 258 with references "zz" and "zz_link" on inode 257 + * inode 259 with reference "a" on inode 257 + * + * And in the current (uncommitted) transaction we have: + * + * inode 257 a directory, unchanged + * inode 258 with references "a" and "a2" on inode 257 + * inode 259 with reference "zz_link" on inode 257 + * inode 261 with reference "zz" on inode 257 + * + * When logging inode 261 the following infinite loop could + * happen if we don't skip already logged inodes: + * + * - we detect inode 258 as a conflicting inode, with inode 261 + * on reference "zz", and log it; + * + * - we detect inode 259 as a conflicting inode, with inode 258 + * on reference "a", and log it; + * + * - we detect inode 258 as a conflicting inode, with inode 259 + * on reference "zz_link", and log it - again! After this we + * repeat the above steps forever. + */ + spin_lock(&BTRFS_I(inode)->lock); + /* + * Check the inode's logged_trans only instead of + * btrfs_inode_in_log(). This is because the last_log_commit of + * the inode is not updated when we only log that it exists and + * and it has the full sync bit set (see btrfs_log_inode()). + */ + if (BTRFS_I(inode)->logged_trans == trans->transid) { + spin_unlock(&BTRFS_I(inode)->lock); + btrfs_add_delayed_iput(inode); + continue; + } + spin_unlock(&BTRFS_I(inode)->lock); + /* * We are safe logging the other inode without acquiring its * lock as long as we log with the LOG_INODE_EXISTS mode. We * are safe against concurrent renames of the other inode as