Message ID | 20200702113159.153135-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/4] btrfs: only commit the delayed inode when doing a full fsync | expand |
On 7/2/20 7:31 AM, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Commit 2c2c452b0cafdc ("Btrfs: fix fsync when extend references are added > to an inode") forced a commit of the delayed inode when logging an inode > in order to ensure we would end up logging the inode item during a full > fsync. By committing the delayed inode, we updated the inode item in the > fs/subvolume tree and then later when copying items from leafs modified in > the current transaction into the log tree (with copy_inode_items_to_log()) > we ended up copying the inode item from the fs/subvolume tree into the log > tree. Logging an up to date version of the inode item is required to make > sure at log replay time we get the link count fixup triggered among other > things (replay xattr deletes, etc). The test case generic/040 from fstests > exercises the bug which that commit fixed. > > However for a fast fsync we don't need to commit the delayed inode because > we always log an up to date version of the inode item based on the struct > btrfs_inode we have in-memory. We started doing this for fast fsyncs since > commit e4545de5b035c7 ("Btrfs: fix fsync data loss after append write"). > > So just stop committing the delayed inode if we are doing a fast fsync, > we are only wasting time and adding contention on fs/subvolume tree. > > This patch is part of a series that has the following patches: > > 1/4 btrfs: only commit the delayed inode when doing a full fsync > 2/4 btrfs: only commit delayed items at fsync if we are logging a directory > 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log > 4/4 btrfs: remove no longer needed use of log_writers for the log root tree > > After the entire patchset applied I saw about 12% decrease on max latency > reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of > ram, using kvm and using a raw NVMe device directly (no intermediary fs on > the host). The test was invoked like the following: > > mkfs.btrfs -f /dev/sdk > mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk > dbench -D /mnt/sdk -t 300 8 > umount /mnt/dsk > > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
On Thu, Jul 02, 2020 at 12:31:59PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Commit 2c2c452b0cafdc ("Btrfs: fix fsync when extend references are added > to an inode") forced a commit of the delayed inode when logging an inode > in order to ensure we would end up logging the inode item during a full > fsync. By committing the delayed inode, we updated the inode item in the > fs/subvolume tree and then later when copying items from leafs modified in > the current transaction into the log tree (with copy_inode_items_to_log()) > we ended up copying the inode item from the fs/subvolume tree into the log > tree. Logging an up to date version of the inode item is required to make > sure at log replay time we get the link count fixup triggered among other > things (replay xattr deletes, etc). The test case generic/040 from fstests > exercises the bug which that commit fixed. > > However for a fast fsync we don't need to commit the delayed inode because > we always log an up to date version of the inode item based on the struct > btrfs_inode we have in-memory. We started doing this for fast fsyncs since > commit e4545de5b035c7 ("Btrfs: fix fsync data loss after append write"). > > So just stop committing the delayed inode if we are doing a fast fsync, > we are only wasting time and adding contention on fs/subvolume tree. > > This patch is part of a series that has the following patches: > > 1/4 btrfs: only commit the delayed inode when doing a full fsync > 2/4 btrfs: only commit delayed items at fsync if we are logging a directory > 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log > 4/4 btrfs: remove no longer needed use of log_writers for the log root tree > > After the entire patchset applied I saw about 12% decrease on max latency > reported by dbench. That's impressive. Getting reliable perf improvements in the low percents is hard and 10+ is beyond expectations. As the patches are short I'd like to tag them for stable. The closest one that applies to all is 5.4, that I determined from the commit references in the changelogs. However I'd appreciate if you could take a look if it's worth to tag the patches for older stable trees where it applies (since 4.4). I don't have full overview of all the logging or fsync updates so might miss some dependency. Thanks.
On Fri, Jul 3, 2020 at 1:08 PM David Sterba <dsterba@suse.cz> wrote: > > On Thu, Jul 02, 2020 at 12:31:59PM +0100, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > Commit 2c2c452b0cafdc ("Btrfs: fix fsync when extend references are added > > to an inode") forced a commit of the delayed inode when logging an inode > > in order to ensure we would end up logging the inode item during a full > > fsync. By committing the delayed inode, we updated the inode item in the > > fs/subvolume tree and then later when copying items from leafs modified in > > the current transaction into the log tree (with copy_inode_items_to_log()) > > we ended up copying the inode item from the fs/subvolume tree into the log > > tree. Logging an up to date version of the inode item is required to make > > sure at log replay time we get the link count fixup triggered among other > > things (replay xattr deletes, etc). The test case generic/040 from fstests > > exercises the bug which that commit fixed. > > > > However for a fast fsync we don't need to commit the delayed inode because > > we always log an up to date version of the inode item based on the struct > > btrfs_inode we have in-memory. We started doing this for fast fsyncs since > > commit e4545de5b035c7 ("Btrfs: fix fsync data loss after append write"). > > > > So just stop committing the delayed inode if we are doing a fast fsync, > > we are only wasting time and adding contention on fs/subvolume tree. > > > > This patch is part of a series that has the following patches: > > > > 1/4 btrfs: only commit the delayed inode when doing a full fsync > > 2/4 btrfs: only commit delayed items at fsync if we are logging a directory > > 3/4 btrfs: stop incremening log_batch for the log root tree when syncing log > > 4/4 btrfs: remove no longer needed use of log_writers for the log root tree > > > > After the entire patchset applied I saw about 12% decrease on max latency > > reported by dbench. > > That's impressive. Getting reliable perf improvements in the low > percents is hard and 10+ is beyond expectations. Well, I don't find it that impressive. Not that it isn't good, just not that huge. While the reported aggregated max latency decreases by around 12%, the aggregated throughput only increased by around 1.2 to 1.5%, probably just noise. > > As the patches are short I'd like to tag them for stable. The closest > one that applies to all is 5.4, that I determined from the commit > references in the changelogs. However I'd appreciate if you could take a > look if it's worth to tag the patches for older stable trees where it > applies (since 4.4). I don't have full overview of all the logging or > fsync updates so might miss some dependency. Thanks. Normally I don't count changes like these for stable, unless they fix a significant performance regression. But I don't oppose adding them to stable either, to whichever releases they apply cleanly. thanks
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index df6d4e3e40b1..44bbf8919883 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5130,7 +5130,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, struct btrfs_key max_key; struct btrfs_root *log = root->log_root; int err = 0; - int ret; + int ret = 0; bool fast_search = false; u64 ino = btrfs_ino(inode); struct extent_map_tree *em_tree = &inode->extent_tree; @@ -5167,14 +5167,16 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, /* * Only run delayed items if we are a dir or a new file. - * Otherwise commit the delayed inode only, which is needed in - * order for the log replay code to mark inodes for link count - * fixup (create temporary BTRFS_TREE_LOG_FIXUP_OBJECTID items). + * Otherwise commit the delayed inode only if the full sync flag is set, + * as we want to make sure an up to date version is in the subvolume + * tree so copy_inode_items_to_log() / copy_items() can find it and copy + * it to the log tree. For a non full sync, we always log the inode item + * based on the in-memory struct btrfs_inode which is always up to date. */ if (S_ISDIR(inode->vfs_inode.i_mode) || inode->generation > fs_info->last_trans_committed) ret = btrfs_commit_inode_delayed_items(trans, inode); - else + else if (test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags)) ret = btrfs_commit_inode_delayed_inode(inode); if (ret) {