[1/4] btrfs: only commit the delayed inode when doing a full fsync
diff mbox series

Message ID 20200702113159.153135-1-fdmanana@kernel.org
State New
Headers show
Series
  • [1/4] btrfs: only commit the delayed inode when doing a full fsync
Related show

Commit Message

Filipe Manana July 2, 2020, 11:31 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Commit 2c2c452b0cafdc ("Btrfs: fix fsync when extend references are added
to an inode") forced a commit of the delayed inode when logging an inode
in order to ensure we would end up logging the inode item during a full
fsync. By committing the delayed inode, we updated the inode item in the
fs/subvolume tree and then later when copying items from leafs modified in
the current transaction into the log tree (with copy_inode_items_to_log())
we ended up copying the inode item from the fs/subvolume tree into the log
tree. Logging an up to date version of the inode item is required to make
sure at log replay time we get the link count fixup triggered among other
things (replay xattr deletes, etc). The test case generic/040 from fstests
exercises the bug which that commit fixed.

However for a fast fsync we don't need to commit the delayed inode because
we always log an up to date version of the inode item based on the struct
btrfs_inode we have in-memory. We started doing this for fast fsyncs since
commit e4545de5b035c7 ("Btrfs: fix fsync data loss after append write").

So just stop committing the delayed inode if we are doing a fast fsync,
we are only wasting time and adding contention on fs/subvolume tree.

This patch is part of a series that has the following patches:

1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree

After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:

  mkfs.btrfs -f /dev/sdk
  mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
  dbench -D /mnt/sdk -t 300 8
  umount /mnt/dsk

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Comments

Josef Bacik July 2, 2020, 1 p.m. UTC | #1
On 7/2/20 7:31 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Commit 2c2c452b0cafdc ("Btrfs: fix fsync when extend references are added
> to an inode") forced a commit of the delayed inode when logging an inode
> in order to ensure we would end up logging the inode item during a full
> fsync. By committing the delayed inode, we updated the inode item in the
> fs/subvolume tree and then later when copying items from leafs modified in
> the current transaction into the log tree (with copy_inode_items_to_log())
> we ended up copying the inode item from the fs/subvolume tree into the log
> tree. Logging an up to date version of the inode item is required to make
> sure at log replay time we get the link count fixup triggered among other
> things (replay xattr deletes, etc). The test case generic/040 from fstests
> exercises the bug which that commit fixed.
> 
> However for a fast fsync we don't need to commit the delayed inode because
> we always log an up to date version of the inode item based on the struct
> btrfs_inode we have in-memory. We started doing this for fast fsyncs since
> commit e4545de5b035c7 ("Btrfs: fix fsync data loss after append write").
> 
> So just stop committing the delayed inode if we are doing a fast fsync,
> we are only wasting time and adding contention on fs/subvolume tree.
> 
> This patch is part of a series that has the following patches:
> 
> 1/4 btrfs: only commit the delayed inode when doing a full fsync
> 2/4 btrfs: only commit delayed items at fsync if we are logging a directory
> 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
> 4/4 btrfs: remove no longer needed use of log_writers for the log root tree
> 
> After the entire patchset applied I saw about 12% decrease on max latency
> reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
> ram, using kvm and using a raw NVMe device directly (no intermediary fs on
> the host). The test was invoked like the following:
> 
>    mkfs.btrfs -f /dev/sdk
>    mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
>    dbench -D /mnt/sdk -t 300 8
>    umount /mnt/dsk
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
David Sterba July 3, 2020, 12:08 p.m. UTC | #2
On Thu, Jul 02, 2020 at 12:31:59PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Commit 2c2c452b0cafdc ("Btrfs: fix fsync when extend references are added
> to an inode") forced a commit of the delayed inode when logging an inode
> in order to ensure we would end up logging the inode item during a full
> fsync. By committing the delayed inode, we updated the inode item in the
> fs/subvolume tree and then later when copying items from leafs modified in
> the current transaction into the log tree (with copy_inode_items_to_log())
> we ended up copying the inode item from the fs/subvolume tree into the log
> tree. Logging an up to date version of the inode item is required to make
> sure at log replay time we get the link count fixup triggered among other
> things (replay xattr deletes, etc). The test case generic/040 from fstests
> exercises the bug which that commit fixed.
> 
> However for a fast fsync we don't need to commit the delayed inode because
> we always log an up to date version of the inode item based on the struct
> btrfs_inode we have in-memory. We started doing this for fast fsyncs since
> commit e4545de5b035c7 ("Btrfs: fix fsync data loss after append write").
> 
> So just stop committing the delayed inode if we are doing a fast fsync,
> we are only wasting time and adding contention on fs/subvolume tree.
> 
> This patch is part of a series that has the following patches:
> 
> 1/4 btrfs: only commit the delayed inode when doing a full fsync
> 2/4 btrfs: only commit delayed items at fsync if we are logging a directory
> 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
> 4/4 btrfs: remove no longer needed use of log_writers for the log root tree
> 
> After the entire patchset applied I saw about 12% decrease on max latency
> reported by dbench.

That's impressive. Getting reliable perf improvements in the low
percents is hard and 10+ is beyond expectations.

As the patches are short I'd like to tag them for stable. The closest
one that applies to all is 5.4, that I determined from the commit
references in the changelogs. However I'd appreciate if you could take a
look if it's worth to tag the patches for older stable trees where it
applies (since 4.4). I don't have full overview of all the logging or
fsync updates so might miss some dependency. Thanks.
Filipe Manana July 3, 2020, 2:36 p.m. UTC | #3
On Fri, Jul 3, 2020 at 1:08 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Thu, Jul 02, 2020 at 12:31:59PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Commit 2c2c452b0cafdc ("Btrfs: fix fsync when extend references are added
> > to an inode") forced a commit of the delayed inode when logging an inode
> > in order to ensure we would end up logging the inode item during a full
> > fsync. By committing the delayed inode, we updated the inode item in the
> > fs/subvolume tree and then later when copying items from leafs modified in
> > the current transaction into the log tree (with copy_inode_items_to_log())
> > we ended up copying the inode item from the fs/subvolume tree into the log
> > tree. Logging an up to date version of the inode item is required to make
> > sure at log replay time we get the link count fixup triggered among other
> > things (replay xattr deletes, etc). The test case generic/040 from fstests
> > exercises the bug which that commit fixed.
> >
> > However for a fast fsync we don't need to commit the delayed inode because
> > we always log an up to date version of the inode item based on the struct
> > btrfs_inode we have in-memory. We started doing this for fast fsyncs since
> > commit e4545de5b035c7 ("Btrfs: fix fsync data loss after append write").
> >
> > So just stop committing the delayed inode if we are doing a fast fsync,
> > we are only wasting time and adding contention on fs/subvolume tree.
> >
> > This patch is part of a series that has the following patches:
> >
> > 1/4 btrfs: only commit the delayed inode when doing a full fsync
> > 2/4 btrfs: only commit delayed items at fsync if we are logging a directory
> > 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
> > 4/4 btrfs: remove no longer needed use of log_writers for the log root tree
> >
> > After the entire patchset applied I saw about 12% decrease on max latency
> > reported by dbench.
>
> That's impressive. Getting reliable perf improvements in the low
> percents is hard and 10+ is beyond expectations.

Well, I don't find it that impressive. Not that it isn't good, just
not that huge.
While the reported aggregated max latency decreases by around 12%, the
aggregated throughput only increased by around 1.2 to 1.5%, probably
just noise.

>
> As the patches are short I'd like to tag them for stable. The closest
> one that applies to all is 5.4, that I determined from the commit
> references in the changelogs. However I'd appreciate if you could take a
> look if it's worth to tag the patches for older stable trees where it
> applies (since 4.4). I don't have full overview of all the logging or
> fsync updates so might miss some dependency. Thanks.

Normally I don't count changes like these for stable, unless they fix
a significant performance regression.
But I don't oppose adding them to stable either, to whichever releases
they apply cleanly.

thanks

Patch
diff mbox series

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index df6d4e3e40b1..44bbf8919883 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5130,7 +5130,7 @@  static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_key max_key;
 	struct btrfs_root *log = root->log_root;
 	int err = 0;
-	int ret;
+	int ret = 0;
 	bool fast_search = false;
 	u64 ino = btrfs_ino(inode);
 	struct extent_map_tree *em_tree = &inode->extent_tree;
@@ -5167,14 +5167,16 @@  static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 
 	/*
 	 * Only run delayed items if we are a dir or a new file.
-	 * Otherwise commit the delayed inode only, which is needed in
-	 * order for the log replay code to mark inodes for link count
-	 * fixup (create temporary BTRFS_TREE_LOG_FIXUP_OBJECTID items).
+	 * Otherwise commit the delayed inode only if the full sync flag is set,
+	 * as we want to make sure an up to date version is in the subvolume
+	 * tree so copy_inode_items_to_log() / copy_items() can find it and copy
+	 * it to the log tree. For a non full sync, we always log the inode item
+	 * based on the in-memory struct btrfs_inode which is always up to date.
 	 */
 	if (S_ISDIR(inode->vfs_inode.i_mode) ||
 	    inode->generation > fs_info->last_trans_committed)
 		ret = btrfs_commit_inode_delayed_items(trans, inode);
-	else
+	else if (test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags))
 		ret = btrfs_commit_inode_delayed_inode(inode);
 
 	if (ret) {