Message ID | 8c391b7c0a7a6a7e827644d424cd4c343f39588e.1611742865.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: more performance improvements for dbench workloads | expand |
On 1/27/21 5:35 AM, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Often an fsync needs to fallback to a transaction commit for several > reasons (to ensure consistency after a power failure, a new block group > was allocated or a temporary error such as ENOMEM or ENOSPC happened). > > In that case the log is marked as needing a full commit and any concurrent > tasks attempting to log inodes or commit the log will also fallback to the > transaction commit. When this happens they all wait for the task that first > started the transaction commit to finish the transaction commit - however > they wait until the full transaction commit happens, which is not needed, > as they only need to wait for the superblocks to be persisted and not for > unpinning all the extents pinned during the transaction's lifetime, which > even for short lived transactions can be a few thousand and take some > significant amount of time to complete - for dbench workloads I have > observed up to 4~5 milliseconds of time spent unpinning extents in the > worst cases, and the number of pinned extents was between 2 to 3 thousand. > > So allow fsync tasks to skip waiting for the unpinning of extents when > they call btrfs_commit_transaction() and they were not the task that > started the transaction commit (that one has to do it, the alternative > would be to offload the transaction commit to another task so that it > could avoid waiting for the extent unpinning or offload the extent > unpinning to another task). > > This patch is part of a patchset comprised of the following patches: > > btrfs: remove unnecessary directory inode item update when deleting dir entry > btrfs: stop setting nbytes when filling inode item for logging > btrfs: avoid logging new ancestor inodes when logging new inode > btrfs: skip logging directories already logged when logging all parents > btrfs: skip logging inodes already logged when logging new entries > btrfs: remove unnecessary check_parent_dirs_for_sync() > btrfs: make concurrent fsyncs wait less when waiting for a transaction commit > > After applying the entire patchset, dbench shows improvements in respect > to throughput and latency. The script used to measure it is the following: > > $ cat dbench-test.sh > #!/bin/bash > > DEV=/dev/sdk > MNT=/mnt/sdk > MOUNT_OPTIONS="-o ssd" > MKFS_OPTIONS="-m single -d single" > > echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor > > umount $DEV &> /dev/null > mkfs.btrfs -f $MKFS_OPTIONS $DEV > mount $MOUNT_OPTIONS $DEV $MNT > > dbench -D $MNT -t 300 64 > > umount $MNT > > The test was run on a physical machine with 12 cores (Intel corei7), 64G > of ram, using a NVMe device and a non-debug kernel configuration (Debian's > default configuration). > > Before applying patchset, 32 clients: > > Operation Count AvgLat MaxLat > ---------------------------------------- > NTCreateX 9627107 0.153 61.938 > Close 7072076 0.001 3.175 > Rename 407633 1.222 44.439 > Unlink 1943895 0.658 44.440 > Deltree 256 17.339 110.891 > Mkdir 128 0.003 0.009 > Qpathinfo 8725406 0.064 17.850 > Qfileinfo 1529516 0.001 2.188 > Qfsinfo 1599884 0.002 1.457 > Sfileinfo 784200 0.005 3.562 > Find 3373513 0.411 30.312 > WriteX 4802132 0.053 29.054 > ReadX 15089959 0.002 5.801 > LockX 31344 0.002 0.425 > UnlockX 31344 0.001 0.173 > Flush 674724 5.952 341.830 > > Throughput 1008.02 MB/sec 32 clients 32 procs max_latency=341.833 ms > > After applying patchset, 32 clients: > > After patchset, with 32 clients: > > Operation Count AvgLat MaxLat > ---------------------------------------- > NTCreateX 9931568 0.111 25.597 > Close 7295730 0.001 2.171 > Rename 420549 0.982 49.714 > Unlink 2005366 0.497 39.015 > Deltree 256 11.149 89.242 > Mkdir 128 0.002 0.014 > Qpathinfo 9001863 0.049 20.761 > Qfileinfo 1577730 0.001 2.546 > Qfsinfo 1650508 0.002 3.531 > Sfileinfo 809031 0.005 5.846 > Find 3480259 0.309 23.977 > WriteX 4952505 0.043 41.283 > ReadX 15568127 0.002 5.476 > LockX 32338 0.002 0.978 > UnlockX 32338 0.001 2.032 > Flush 696017 7.485 228.835 > > Throughput 1049.91 MB/sec 32 clients 32 procs max_latency=228.847 ms > > --> +4.1% throughput, -39.6% max latency > > Before applying patchset, 64 clients: > > Operation Count AvgLat MaxLat > ---------------------------------------- > NTCreateX 8956748 0.342 108.312 > Close 6579660 0.001 3.823 > Rename 379209 2.396 81.897 > Unlink 1808625 1.108 131.148 > Deltree 256 25.632 172.176 > Mkdir 128 0.003 0.018 > Qpathinfo 8117615 0.131 55.916 > Qfileinfo 1423495 0.001 2.635 > Qfsinfo 1488496 0.002 5.412 > Sfileinfo 729472 0.007 8.643 > Find 3138598 0.855 78.321 > WriteX 4470783 0.102 79.442 > ReadX 14038139 0.002 7.578 > LockX 29158 0.002 0.844 > UnlockX 29158 0.001 0.567 > Flush 627746 14.168 506.151 > > Throughput 924.738 MB/sec 64 clients 64 procs max_latency=506.154 ms > > After applying patchset, 64 clients: > > Operation Count AvgLat MaxLat > ---------------------------------------- > NTCreateX 9069003 0.303 43.193 > Close 6662328 0.001 3.888 > Rename 383976 2.194 46.418 > Unlink 1831080 1.022 43.873 > Deltree 256 24.037 155.763 > Mkdir 128 0.002 0.005 > Qpathinfo 8219173 0.137 30.233 > Qfileinfo 1441203 0.001 3.204 > Qfsinfo 1507092 0.002 4.055 > Sfileinfo 738775 0.006 5.431 > Find 3177874 0.936 38.170 > WriteX 4526152 0.084 39.518 > ReadX 14213562 0.002 24.760 > LockX 29522 0.002 1.221 > UnlockX 29522 0.001 0.694 > Flush 635652 14.358 422.039 > > Throughput 990.13 MB/sec 64 clients 64 procs max_latency=422.043 ms > > --> +6.8% throughput, -18.1% max latency > > Signed-off-by: Filipe Manana <fdmanana@suse.com> Funnily enough I've got a patch to async off unpinning as it drastically affects our WhatsApp workload. Once that lands maybe we can undo this bit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index d81ae1f518f2..be5350f5bedf 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2238,6 +2238,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) ret = PTR_ERR(trans); goto out_release_extents; } + trans->in_fsync = true; ret = btrfs_log_dentry_safe(trans, dentry, &ctx); btrfs_release_log_ctx_extents(&ctx); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 3bcb5444536e..ff8efa6f8986 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -107,6 +107,11 @@ static const unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = { __TRANS_JOIN | __TRANS_JOIN_NOLOCK | __TRANS_JOIN_NOSTART), + [TRANS_STATE_SUPER_COMMITTED] = (__TRANS_START | + __TRANS_ATTACH | + __TRANS_JOIN | + __TRANS_JOIN_NOLOCK | + __TRANS_JOIN_NOSTART), [TRANS_STATE_COMPLETED] = (__TRANS_START | __TRANS_ATTACH | __TRANS_JOIN | @@ -826,10 +831,11 @@ btrfs_attach_transaction_barrier(struct btrfs_root *root) return trans; } -/* wait for a transaction commit to be fully complete */ -static noinline void wait_for_commit(struct btrfs_transaction *commit) +/* Wait for a transaction commit to reach at least the given state. */ +static noinline void wait_for_commit(struct btrfs_transaction *commit, + const enum btrfs_trans_state min_state) { - wait_event(commit->commit_wait, commit->state == TRANS_STATE_COMPLETED); + wait_event(commit->commit_wait, commit->state >= min_state); } int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid) @@ -884,7 +890,7 @@ int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid) goto out; /* nothing committing|committed */ } - wait_for_commit(cur_trans); + wait_for_commit(cur_trans, TRANS_STATE_COMPLETED); btrfs_put_transaction(cur_trans); out: return ret; @@ -2102,11 +2108,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) spin_lock(&fs_info->trans_lock); if (cur_trans->state >= TRANS_STATE_COMMIT_START) { + enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED; + spin_unlock(&fs_info->trans_lock); refcount_inc(&cur_trans->use_count); - ret = btrfs_end_transaction(trans); - wait_for_commit(cur_trans); + if (trans->in_fsync) + want_state = TRANS_STATE_SUPER_COMMITTED; + ret = btrfs_end_transaction(trans); + wait_for_commit(cur_trans, want_state); if (TRANS_ABORTED(cur_trans)) ret = cur_trans->aborted; @@ -2120,13 +2130,19 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) wake_up(&fs_info->transaction_blocked_wait); if (cur_trans->list.prev != &fs_info->trans_list) { + enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED; + + if (trans->in_fsync) + want_state = TRANS_STATE_SUPER_COMMITTED; + prev_trans = list_entry(cur_trans->list.prev, struct btrfs_transaction, list); - if (prev_trans->state != TRANS_STATE_COMPLETED) { + if (prev_trans->state < want_state) { refcount_inc(&prev_trans->use_count); spin_unlock(&fs_info->trans_lock); - wait_for_commit(prev_trans); + wait_for_commit(prev_trans, want_state); + ret = READ_ONCE(prev_trans->aborted); btrfs_put_transaction(prev_trans); @@ -2345,6 +2361,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) if (ret) goto scrub_continue; + /* + * We needn't acquire the lock here because there is no other task + * which can change it. + */ + cur_trans->state = TRANS_STATE_SUPER_COMMITTED; + wake_up(&cur_trans->commit_wait); + btrfs_finish_extent_commit(trans); if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &cur_trans->flags)) diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 31ca81bad822..935bd6958a8a 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -16,6 +16,7 @@ enum btrfs_trans_state { TRANS_STATE_COMMIT_START, TRANS_STATE_COMMIT_DOING, TRANS_STATE_UNBLOCKED, + TRANS_STATE_SUPER_COMMITTED, TRANS_STATE_COMPLETED, TRANS_STATE_MAX, }; @@ -133,6 +134,7 @@ struct btrfs_trans_handle { bool can_flush_pending_bgs; bool reloc_reserved; bool dirty; + bool in_fsync; struct btrfs_root *root; struct btrfs_fs_info *fs_info; struct list_head new_bgs;