diff mbox series

[6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()

Message ID 77b21c64a5aed56e5602c59558c1b09254f3b494.1611742865.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: more performance improvements for dbench workloads | expand

Commit Message

Filipe Manana Jan. 27, 2021, 10:34 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Whenever we fsync an inode, if it is a directory, a regular file that was
created in the current transaction or has last_unlink_trans set to the
generation of the current transaction, we check if any of its ancestor
inodes (and the inode itself if it is a directory) can not be logged and
need a fallback to a full transaction commit - if so, we return with a
value of 1 in order to fallback to a transaction commit.

However we often do not need to fallback to a transaction commit because:

1) The ancestor inode is not an immediate parent, and therefore there is
   not an explicit request to log it and it is not needed neither to
   guarantee the consistency of the inode originally asked to be logged
   (fsynced) nor its immediate parent;

2) The ancestor inode was already logged before, in which case any link,
   unlink or rename operation updates the log as needed.

So for these two cases we can avoid an unnecessary transaction commit.
Therefore remove check_parent_dirs_for_sync() and add a check at the top
of btrfs_log_inode() to make us fallback immediately to a transaction
commit when we are logging a directory inode that can not be logged and
needs a full transaction commit. All we need to protect is the case where
after renaming a file someone fsyncs only the old directory, which would
result is losing the renamed file after a log replay.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 121 ++++++--------------------------------------
 1 file changed, 15 insertions(+), 106 deletions(-)

Comments

Josef Bacik Jan. 27, 2021, 3:23 p.m. UTC | #1
On 1/27/21 5:34 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Whenever we fsync an inode, if it is a directory, a regular file that was
> created in the current transaction or has last_unlink_trans set to the
> generation of the current transaction, we check if any of its ancestor
> inodes (and the inode itself if it is a directory) can not be logged and
> need a fallback to a full transaction commit - if so, we return with a
> value of 1 in order to fallback to a transaction commit.
> 
> However we often do not need to fallback to a transaction commit because:
> 
> 1) The ancestor inode is not an immediate parent, and therefore there is
>     not an explicit request to log it and it is not needed neither to
>     guarantee the consistency of the inode originally asked to be logged
>     (fsynced) nor its immediate parent;
> 
> 2) The ancestor inode was already logged before, in which case any link,
>     unlink or rename operation updates the log as needed.
> 
> So for these two cases we can avoid an unnecessary transaction commit.
> Therefore remove check_parent_dirs_for_sync() and add a check at the top
> of btrfs_log_inode() to make us fallback immediately to a transaction
> commit when we are logging a directory inode that can not be logged and
> needs a full transaction commit. All we need to protect is the case where
> after renaming a file someone fsyncs only the old directory, which would
> result is losing the renamed file after a log replay.
> 
> This patch is part of a patchset comprised of the following patches:
> 
>    btrfs: remove unnecessary directory inode item update when deleting dir entry
>    btrfs: stop setting nbytes when filling inode item for logging
>    btrfs: avoid logging new ancestor inodes when logging new inode
>    btrfs: skip logging directories already logged when logging all parents
>    btrfs: skip logging inodes already logged when logging new entries
>    btrfs: remove unnecessary check_parent_dirs_for_sync()
>    btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
> 
> Performance results, after applying all patches, are mentioned in the
> change log of the last patch.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

I'm having a hard time with this one.

Previously we would commit the transaction if the inode was a regular file, that 
was created in this current transaction, and had been renamed.  Now with this 
patch you're only committing the transaction if we are a directory and were 
renamed ourselves.  Before if you already had directories A and B and then did 
something like

echo "foo" > /mnt/test/A/blah
fsync(/mnt/test/A/blah);
fsync(/mnt/test/A);
mv /mnt/test/A/blah /mnt/test/B
fsync(/mnt/test/B/blah);

we would commit the transaction on this second fsync, but with your patch we are 
not.  I suppose that's keeping in line with how fsync is allowed to work, but 
it's definitely a change in behavior from what we used to do.  Not sure if 
that's good or not, I'll have to think about it.  Thanks,

Josef
Filipe Manana Jan. 27, 2021, 3:36 p.m. UTC | #2
On Wed, Jan 27, 2021 at 3:23 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On 1/27/21 5:34 AM, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Whenever we fsync an inode, if it is a directory, a regular file that was
> > created in the current transaction or has last_unlink_trans set to the
> > generation of the current transaction, we check if any of its ancestor
> > inodes (and the inode itself if it is a directory) can not be logged and
> > need a fallback to a full transaction commit - if so, we return with a
> > value of 1 in order to fallback to a transaction commit.
> >
> > However we often do not need to fallback to a transaction commit because:
> >
> > 1) The ancestor inode is not an immediate parent, and therefore there is
> >     not an explicit request to log it and it is not needed neither to
> >     guarantee the consistency of the inode originally asked to be logged
> >     (fsynced) nor its immediate parent;
> >
> > 2) The ancestor inode was already logged before, in which case any link,
> >     unlink or rename operation updates the log as needed.
> >
> > So for these two cases we can avoid an unnecessary transaction commit.
> > Therefore remove check_parent_dirs_for_sync() and add a check at the top
> > of btrfs_log_inode() to make us fallback immediately to a transaction
> > commit when we are logging a directory inode that can not be logged and
> > needs a full transaction commit. All we need to protect is the case where
> > after renaming a file someone fsyncs only the old directory, which would
> > result is losing the renamed file after a log replay.
> >
> > This patch is part of a patchset comprised of the following patches:
> >
> >    btrfs: remove unnecessary directory inode item update when deleting dir entry
> >    btrfs: stop setting nbytes when filling inode item for logging
> >    btrfs: avoid logging new ancestor inodes when logging new inode
> >    btrfs: skip logging directories already logged when logging all parents
> >    btrfs: skip logging inodes already logged when logging new entries
> >    btrfs: remove unnecessary check_parent_dirs_for_sync()
> >    btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
> >
> > Performance results, after applying all patches, are mentioned in the
> > change log of the last patch.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> I'm having a hard time with this one.
>
> Previously we would commit the transaction if the inode was a regular file, that
> was created in this current transaction, and had been renamed.  Now with this
> patch you're only committing the transaction if we are a directory and were
> renamed ourselves.  Before if you already had directories A and B and then did
> something like
>
> echo "foo" > /mnt/test/A/blah
> fsync(/mnt/test/A/blah);
> fsync(/mnt/test/A);
> mv /mnt/test/A/blah /mnt/test/B
> fsync(/mnt/test/B/blah);
>
> we would commit the transaction on this second fsync, but with your patch we are
> not.  I suppose that's keeping in line with how fsync is allowed to work, but
> it's definitely a change in behavior from what we used to do.  Not sure if
> that's good or not, I'll have to think about it.  Thanks,

Yes. Because of the rename (or a link), we will set last_unlink_trans
to the current transaction, and when logging the file that will cause
logging of all its old parents (A). That was added several years ago
to fix corruptions, and it turned out to be needed later as well to
ensure we have
a behaviour similar to xfs and ext4 (and others) regarding strictly
ordered metadata updates (I added several tests to fstests over the
years for all the cases).
There's also the fact that on replay we will delete any inode refs
that aren't in the log (that one was added in commit 1f250e929a9c
("Btrfs: fix log replay failure after unlink and link combination").

For that example we also have A updated in the log by the rename. So
we know the log is consistent.

So that's why the whole check_parents_for_sync() is not needed.

Thanks.

>
> Josef
Josef Bacik Jan. 27, 2021, 3:42 p.m. UTC | #3
On 1/27/21 10:36 AM, Filipe Manana wrote:
> On Wed, Jan 27, 2021 at 3:23 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> On 1/27/21 5:34 AM, fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> Whenever we fsync an inode, if it is a directory, a regular file that was
>>> created in the current transaction or has last_unlink_trans set to the
>>> generation of the current transaction, we check if any of its ancestor
>>> inodes (and the inode itself if it is a directory) can not be logged and
>>> need a fallback to a full transaction commit - if so, we return with a
>>> value of 1 in order to fallback to a transaction commit.
>>>
>>> However we often do not need to fallback to a transaction commit because:
>>>
>>> 1) The ancestor inode is not an immediate parent, and therefore there is
>>>      not an explicit request to log it and it is not needed neither to
>>>      guarantee the consistency of the inode originally asked to be logged
>>>      (fsynced) nor its immediate parent;
>>>
>>> 2) The ancestor inode was already logged before, in which case any link,
>>>      unlink or rename operation updates the log as needed.
>>>
>>> So for these two cases we can avoid an unnecessary transaction commit.
>>> Therefore remove check_parent_dirs_for_sync() and add a check at the top
>>> of btrfs_log_inode() to make us fallback immediately to a transaction
>>> commit when we are logging a directory inode that can not be logged and
>>> needs a full transaction commit. All we need to protect is the case where
>>> after renaming a file someone fsyncs only the old directory, which would
>>> result is losing the renamed file after a log replay.
>>>
>>> This patch is part of a patchset comprised of the following patches:
>>>
>>>     btrfs: remove unnecessary directory inode item update when deleting dir entry
>>>     btrfs: stop setting nbytes when filling inode item for logging
>>>     btrfs: avoid logging new ancestor inodes when logging new inode
>>>     btrfs: skip logging directories already logged when logging all parents
>>>     btrfs: skip logging inodes already logged when logging new entries
>>>     btrfs: remove unnecessary check_parent_dirs_for_sync()
>>>     btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
>>>
>>> Performance results, after applying all patches, are mentioned in the
>>> change log of the last patch.
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>
>> I'm having a hard time with this one.
>>
>> Previously we would commit the transaction if the inode was a regular file, that
>> was created in this current transaction, and had been renamed.  Now with this
>> patch you're only committing the transaction if we are a directory and were
>> renamed ourselves.  Before if you already had directories A and B and then did
>> something like
>>
>> echo "foo" > /mnt/test/A/blah
>> fsync(/mnt/test/A/blah);
>> fsync(/mnt/test/A);
>> mv /mnt/test/A/blah /mnt/test/B
>> fsync(/mnt/test/B/blah);
>>
>> we would commit the transaction on this second fsync, but with your patch we are
>> not.  I suppose that's keeping in line with how fsync is allowed to work, but
>> it's definitely a change in behavior from what we used to do.  Not sure if
>> that's good or not, I'll have to think about it.  Thanks,
> 
> Yes. Because of the rename (or a link), we will set last_unlink_trans
> to the current transaction, and when logging the file that will cause
> logging of all its old parents (A). That was added several years ago
> to fix corruptions, and it turned out to be needed later as well to
> ensure we have
> a behaviour similar to xfs and ext4 (and others) regarding strictly
> ordered metadata updates (I added several tests to fstests over the
> years for all the cases).
> There's also the fact that on replay we will delete any inode refs
> that aren't in the log (that one was added in commit 1f250e929a9c
> ("Btrfs: fix log replay failure after unlink and link combination").
> 
> For that example we also have A updated in the log by the rename. So
> we know the log is consistent.
> 
> So that's why the whole check_parents_for_sync() is not needed.
> 

Ok that's reasonable, thanks,

Josef
diff mbox series

Patch

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6dc376a16cf2..4c7b283ed2b2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5265,6 +5265,21 @@  static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 		mutex_lock(&inode->log_mutex);
 	}
 
+	/*
+	 * This is for cases where logging a directory could result in losing a
+	 * a file after replaying the log. For example, if we move a file from a
+	 * directory A to a directory B, then fsync directory A, we have no way
+	 * to known the file was moved from A to B, so logging just A would
+	 * result in losing the file after a log replay.
+	 */
+	if (S_ISDIR(inode->vfs_inode.i_mode) &&
+	    inode_only == LOG_INODE_ALL &&
+	    inode->last_unlink_trans >= trans->transid) {
+		btrfs_set_log_full_commit(trans);
+		err = 1;
+		goto out_unlock;
+	}
+
 	/*
 	 * a brute force approach to making sure we get the most uptodate
 	 * copies of everything.
@@ -5428,99 +5443,6 @@  static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 	return err;
 }
 
-/*
- * Check if we must fallback to a transaction commit when logging an inode.
- * This must be called after logging the inode and is used only in the context
- * when fsyncing an inode requires the need to log some other inode - in which
- * case we can't lock the i_mutex of each other inode we need to log as that
- * can lead to deadlocks with concurrent fsync against other inodes (as we can
- * log inodes up or down in the hierarchy) or rename operations for example. So
- * we take the log_mutex of the inode after we have logged it and then check for
- * its last_unlink_trans value - this is safe because any task setting
- * last_unlink_trans must take the log_mutex and it must do this before it does
- * the actual unlink operation, so if we do this check before a concurrent task
- * sets last_unlink_trans it means we've logged a consistent version/state of
- * all the inode items, otherwise we are not sure and must do a transaction
- * commit (the concurrent task might have only updated last_unlink_trans before
- * we logged the inode or it might have also done the unlink).
- */
-static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans,
-					  struct btrfs_inode *inode)
-{
-	bool ret = false;
-
-	mutex_lock(&inode->log_mutex);
-	if (inode->last_unlink_trans >= trans->transid) {
-		/*
-		 * Make sure any commits to the log are forced to be full
-		 * commits.
-		 */
-		btrfs_set_log_full_commit(trans);
-		ret = true;
-	}
-	mutex_unlock(&inode->log_mutex);
-
-	return ret;
-}
-
-/*
- * follow the dentry parent pointers up the chain and see if any
- * of the directories in it require a full commit before they can
- * be logged.  Returns zero if nothing special needs to be done or 1 if
- * a full commit is required.
- */
-static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
-					       struct btrfs_inode *inode,
-					       struct dentry *parent,
-					       struct super_block *sb)
-{
-	int ret = 0;
-	struct dentry *old_parent = NULL;
-
-	/*
-	 * for regular files, if its inode is already on disk, we don't
-	 * have to worry about the parents at all.  This is because
-	 * we can use the last_unlink_trans field to record renames
-	 * and other fun in this file.
-	 */
-	if (S_ISREG(inode->vfs_inode.i_mode) &&
-	    inode->generation < trans->transid &&
-	    inode->last_unlink_trans < trans->transid)
-		goto out;
-
-	if (!S_ISDIR(inode->vfs_inode.i_mode)) {
-		if (!parent || d_really_is_negative(parent) || sb != parent->d_sb)
-			goto out;
-		inode = BTRFS_I(d_inode(parent));
-	}
-
-	while (1) {
-		if (btrfs_must_commit_transaction(trans, inode)) {
-			ret = 1;
-			break;
-		}
-
-		if (!parent || d_really_is_negative(parent) || sb != parent->d_sb)
-			break;
-
-		if (IS_ROOT(parent)) {
-			inode = BTRFS_I(d_inode(parent));
-			if (btrfs_must_commit_transaction(trans, inode))
-				ret = 1;
-			break;
-		}
-
-		parent = dget_parent(parent);
-		dput(old_parent);
-		old_parent = parent;
-		inode = BTRFS_I(d_inode(parent));
-
-	}
-	dput(old_parent);
-out:
-	return ret;
-}
-
 /*
  * Check if we need to log an inode. This is used in contexts where while
  * logging an inode we need to log another inode (either that it exists or in
@@ -5686,9 +5608,6 @@  static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 				log_mode = LOG_INODE_ALL;
 			ret = btrfs_log_inode(trans, root, BTRFS_I(di_inode),
 					      log_mode, ctx);
-			if (!ret &&
-			    btrfs_must_commit_transaction(trans, BTRFS_I(di_inode)))
-				ret = 1;
 			btrfs_add_delayed_iput(di_inode);
 			if (ret)
 				goto next_dir_inode;
@@ -5835,9 +5754,6 @@  static int btrfs_log_all_parents(struct btrfs_trans_handle *trans,
 				ctx->log_new_dentries = false;
 			ret = btrfs_log_inode(trans, root, BTRFS_I(dir_inode),
 					      LOG_INODE_ALL, ctx);
-			if (!ret &&
-			    btrfs_must_commit_transaction(trans, BTRFS_I(dir_inode)))
-				ret = 1;
 			if (!ret && ctx && ctx->log_new_dentries)
 				ret = log_new_dir_dentries(trans, root,
 						   BTRFS_I(dir_inode), ctx);
@@ -6053,12 +5969,9 @@  static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct super_block *sb;
 	int ret = 0;
 	bool log_dentries = false;
 
-	sb = inode->vfs_inode.i_sb;
-
 	if (btrfs_test_opt(fs_info, NOTREELOG)) {
 		ret = 1;
 		goto end_no_trans;
@@ -6069,10 +5982,6 @@  static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
 		goto end_no_trans;
 	}
 
-	ret = check_parent_dirs_for_sync(trans, inode, parent, sb);
-	if (ret)
-		goto end_no_trans;
-
 	/*
 	 * Skip already logged inodes or inodes corresponding to tmpfiles
 	 * (since logging them is pointless, a link count of 0 means they