diff mbox series

[7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Message ID 8c391b7c0a7a6a7e827644d424cd4c343f39588e.1611742865.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: more performance improvements for dbench workloads | expand

Commit Message

Filipe Manana Jan. 27, 2021, 10:35 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Often an fsync needs to fallback to a transaction commit for several
reasons (to ensure consistency after a power failure, a new block group
was allocated or a temporary error such as ENOMEM or ENOSPC happened).

In that case the log is marked as needing a full commit and any concurrent
tasks attempting to log inodes or commit the log will also fallback to the
transaction commit. When this happens they all wait for the task that first
started the transaction commit to finish the transaction commit - however
they wait until the full transaction commit happens, which is not needed,
as they only need to wait for the superblocks to be persisted and not for
unpinning all the extents pinned during the transaction's lifetime, which
even for short lived transactions can be a few thousand and take some
significant amount of time to complete - for dbench workloads I have
observed up to 4~5 milliseconds of time spent unpinning extents in the
worst cases, and the number of pinned extents was between 2 to 3 thousand.

So allow fsync tasks to skip waiting for the unpinning of extents when
they call btrfs_commit_transaction() and they were not the task that
started the transaction commit (that one has to do it, the alternative
would be to offload the transaction commit to another task so that it
could avoid waiting for the extent unpinning or offload the extent
unpinning to another task).

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

After applying the entire patchset, dbench shows improvements in respect
to throughput and latency. The script used to measure it is the following:

  $ cat dbench-test.sh
  #!/bin/bash

  DEV=/dev/sdk
  MNT=/mnt/sdk
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-m single -d single"

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $DEV &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 300 64

  umount $MNT

The test was run on a physical machine with 12 cores (Intel corei7), 64G
of ram, using a NVMe device and a non-debug kernel configuration (Debian's
default configuration).

Before applying patchset, 32 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9627107     0.153    61.938
 Close        7072076     0.001     3.175
 Rename        407633     1.222    44.439
 Unlink       1943895     0.658    44.440
 Deltree          256    17.339   110.891
 Mkdir            128     0.003     0.009
 Qpathinfo    8725406     0.064    17.850
 Qfileinfo    1529516     0.001     2.188
 Qfsinfo      1599884     0.002     1.457
 Sfileinfo     784200     0.005     3.562
 Find         3373513     0.411    30.312
 WriteX       4802132     0.053    29.054
 ReadX        15089959     0.002     5.801
 LockX          31344     0.002     0.425
 UnlockX        31344     0.001     0.173
 Flush         674724     5.952   341.830

Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms

After applying patchset, 32 clients:

After patchset, with 32 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9931568     0.111    25.597
 Close        7295730     0.001     2.171
 Rename        420549     0.982    49.714
 Unlink       2005366     0.497    39.015
 Deltree          256    11.149    89.242
 Mkdir            128     0.002     0.014
 Qpathinfo    9001863     0.049    20.761
 Qfileinfo    1577730     0.001     2.546
 Qfsinfo      1650508     0.002     3.531
 Sfileinfo     809031     0.005     5.846
 Find         3480259     0.309    23.977
 WriteX       4952505     0.043    41.283
 ReadX        15568127     0.002     5.476
 LockX          32338     0.002     0.978
 UnlockX        32338     0.001     2.032
 Flush         696017     7.485   228.835

Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms

 --> +4.1% throughput, -39.6% max latency

Before applying patchset, 64 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    8956748     0.342   108.312
 Close        6579660     0.001     3.823
 Rename        379209     2.396    81.897
 Unlink       1808625     1.108   131.148
 Deltree          256    25.632   172.176
 Mkdir            128     0.003     0.018
 Qpathinfo    8117615     0.131    55.916
 Qfileinfo    1423495     0.001     2.635
 Qfsinfo      1488496     0.002     5.412
 Sfileinfo     729472     0.007     8.643
 Find         3138598     0.855    78.321
 WriteX       4470783     0.102    79.442
 ReadX        14038139     0.002     7.578
 LockX          29158     0.002     0.844
 UnlockX        29158     0.001     0.567
 Flush         627746    14.168   506.151

Throughput 924.738 MB/sec  64 clients  64 procs  max_latency=506.154 ms

After applying patchset, 64 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9069003     0.303    43.193
 Close        6662328     0.001     3.888
 Rename        383976     2.194    46.418
 Unlink       1831080     1.022    43.873
 Deltree          256    24.037   155.763
 Mkdir            128     0.002     0.005
 Qpathinfo    8219173     0.137    30.233
 Qfileinfo    1441203     0.001     3.204
 Qfsinfo      1507092     0.002     4.055
 Sfileinfo     738775     0.006     5.431
 Find         3177874     0.936    38.170
 WriteX       4526152     0.084    39.518
 ReadX        14213562     0.002    24.760
 LockX          29522     0.002     1.221
 UnlockX        29522     0.001     0.694
 Flush         635652    14.358   422.039

Throughput 990.13 MB/sec  64 clients  64 procs  max_latency=422.043 ms

 --> +6.8% throughput, -18.1% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c        |  1 +
 fs/btrfs/transaction.c | 39 +++++++++++++++++++++++++++++++--------
 fs/btrfs/transaction.h |  2 ++
 3 files changed, 34 insertions(+), 8 deletions(-)

Comments

Josef Bacik Jan. 27, 2021, 3:26 p.m. UTC | #1
On 1/27/21 5:35 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Often an fsync needs to fallback to a transaction commit for several
> reasons (to ensure consistency after a power failure, a new block group
> was allocated or a temporary error such as ENOMEM or ENOSPC happened).
> 
> In that case the log is marked as needing a full commit and any concurrent
> tasks attempting to log inodes or commit the log will also fallback to the
> transaction commit. When this happens they all wait for the task that first
> started the transaction commit to finish the transaction commit - however
> they wait until the full transaction commit happens, which is not needed,
> as they only need to wait for the superblocks to be persisted and not for
> unpinning all the extents pinned during the transaction's lifetime, which
> even for short lived transactions can be a few thousand and take some
> significant amount of time to complete - for dbench workloads I have
> observed up to 4~5 milliseconds of time spent unpinning extents in the
> worst cases, and the number of pinned extents was between 2 to 3 thousand.
> 
> So allow fsync tasks to skip waiting for the unpinning of extents when
> they call btrfs_commit_transaction() and they were not the task that
> started the transaction commit (that one has to do it, the alternative
> would be to offload the transaction commit to another task so that it
> could avoid waiting for the extent unpinning or offload the extent
> unpinning to another task).
> 
> This patch is part of a patchset comprised of the following patches:
> 
>    btrfs: remove unnecessary directory inode item update when deleting dir entry
>    btrfs: stop setting nbytes when filling inode item for logging
>    btrfs: avoid logging new ancestor inodes when logging new inode
>    btrfs: skip logging directories already logged when logging all parents
>    btrfs: skip logging inodes already logged when logging new entries
>    btrfs: remove unnecessary check_parent_dirs_for_sync()
>    btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
> 
> After applying the entire patchset, dbench shows improvements in respect
> to throughput and latency. The script used to measure it is the following:
> 
>    $ cat dbench-test.sh
>    #!/bin/bash
> 
>    DEV=/dev/sdk
>    MNT=/mnt/sdk
>    MOUNT_OPTIONS="-o ssd"
>    MKFS_OPTIONS="-m single -d single"
> 
>    echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> 
>    umount $DEV &> /dev/null
>    mkfs.btrfs -f $MKFS_OPTIONS $DEV
>    mount $MOUNT_OPTIONS $DEV $MNT
> 
>    dbench -D $MNT -t 300 64
> 
>    umount $MNT
> 
> The test was run on a physical machine with 12 cores (Intel corei7), 64G
> of ram, using a NVMe device and a non-debug kernel configuration (Debian's
> default configuration).
> 
> Before applying patchset, 32 clients:
> 
>   Operation      Count    AvgLat    MaxLat
>   ----------------------------------------
>   NTCreateX    9627107     0.153    61.938
>   Close        7072076     0.001     3.175
>   Rename        407633     1.222    44.439
>   Unlink       1943895     0.658    44.440
>   Deltree          256    17.339   110.891
>   Mkdir            128     0.003     0.009
>   Qpathinfo    8725406     0.064    17.850
>   Qfileinfo    1529516     0.001     2.188
>   Qfsinfo      1599884     0.002     1.457
>   Sfileinfo     784200     0.005     3.562
>   Find         3373513     0.411    30.312
>   WriteX       4802132     0.053    29.054
>   ReadX        15089959     0.002     5.801
>   LockX          31344     0.002     0.425
>   UnlockX        31344     0.001     0.173
>   Flush         674724     5.952   341.830
> 
> Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms
> 
> After applying patchset, 32 clients:
> 
> After patchset, with 32 clients:
> 
>   Operation      Count    AvgLat    MaxLat
>   ----------------------------------------
>   NTCreateX    9931568     0.111    25.597
>   Close        7295730     0.001     2.171
>   Rename        420549     0.982    49.714
>   Unlink       2005366     0.497    39.015
>   Deltree          256    11.149    89.242
>   Mkdir            128     0.002     0.014
>   Qpathinfo    9001863     0.049    20.761
>   Qfileinfo    1577730     0.001     2.546
>   Qfsinfo      1650508     0.002     3.531
>   Sfileinfo     809031     0.005     5.846
>   Find         3480259     0.309    23.977
>   WriteX       4952505     0.043    41.283
>   ReadX        15568127     0.002     5.476
>   LockX          32338     0.002     0.978
>   UnlockX        32338     0.001     2.032
>   Flush         696017     7.485   228.835
> 
> Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms
> 
>   --> +4.1% throughput, -39.6% max latency
> 
> Before applying patchset, 64 clients:
> 
>   Operation      Count    AvgLat    MaxLat
>   ----------------------------------------
>   NTCreateX    8956748     0.342   108.312
>   Close        6579660     0.001     3.823
>   Rename        379209     2.396    81.897
>   Unlink       1808625     1.108   131.148
>   Deltree          256    25.632   172.176
>   Mkdir            128     0.003     0.018
>   Qpathinfo    8117615     0.131    55.916
>   Qfileinfo    1423495     0.001     2.635
>   Qfsinfo      1488496     0.002     5.412
>   Sfileinfo     729472     0.007     8.643
>   Find         3138598     0.855    78.321
>   WriteX       4470783     0.102    79.442
>   ReadX        14038139     0.002     7.578
>   LockX          29158     0.002     0.844
>   UnlockX        29158     0.001     0.567
>   Flush         627746    14.168   506.151
> 
> Throughput 924.738 MB/sec  64 clients  64 procs  max_latency=506.154 ms
> 
> After applying patchset, 64 clients:
> 
>   Operation      Count    AvgLat    MaxLat
>   ----------------------------------------
>   NTCreateX    9069003     0.303    43.193
>   Close        6662328     0.001     3.888
>   Rename        383976     2.194    46.418
>   Unlink       1831080     1.022    43.873
>   Deltree          256    24.037   155.763
>   Mkdir            128     0.002     0.005
>   Qpathinfo    8219173     0.137    30.233
>   Qfileinfo    1441203     0.001     3.204
>   Qfsinfo      1507092     0.002     4.055
>   Sfileinfo     738775     0.006     5.431
>   Find         3177874     0.936    38.170
>   WriteX       4526152     0.084    39.518
>   ReadX        14213562     0.002    24.760
>   LockX          29522     0.002     1.221
>   UnlockX        29522     0.001     0.694
>   Flush         635652    14.358   422.039
> 
> Throughput 990.13 MB/sec  64 clients  64 procs  max_latency=422.043 ms
> 
>   --> +6.8% throughput, -18.1% max latency
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Funnily enough I've got a patch to async off unpinning as it drastically affects 
our WhatsApp workload.  Once that lands maybe we can undo this bit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
diff mbox series

Patch

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d81ae1f518f2..be5350f5bedf 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2238,6 +2238,7 @@  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		ret = PTR_ERR(trans);
 		goto out_release_extents;
 	}
+	trans->in_fsync = true;
 
 	ret = btrfs_log_dentry_safe(trans, dentry, &ctx);
 	btrfs_release_log_ctx_extents(&ctx);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3bcb5444536e..ff8efa6f8986 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -107,6 +107,11 @@  static const unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
 					   __TRANS_JOIN |
 					   __TRANS_JOIN_NOLOCK |
 					   __TRANS_JOIN_NOSTART),
+	[TRANS_STATE_SUPER_COMMITTED]	= (__TRANS_START |
+					   __TRANS_ATTACH |
+					   __TRANS_JOIN |
+					   __TRANS_JOIN_NOLOCK |
+					   __TRANS_JOIN_NOSTART),
 	[TRANS_STATE_COMPLETED]		= (__TRANS_START |
 					   __TRANS_ATTACH |
 					   __TRANS_JOIN |
@@ -826,10 +831,11 @@  btrfs_attach_transaction_barrier(struct btrfs_root *root)
 	return trans;
 }
 
-/* wait for a transaction commit to be fully complete */
-static noinline void wait_for_commit(struct btrfs_transaction *commit)
+/* Wait for a transaction commit to reach at least the given state. */
+static noinline void wait_for_commit(struct btrfs_transaction *commit,
+				     const enum btrfs_trans_state min_state)
 {
-	wait_event(commit->commit_wait, commit->state == TRANS_STATE_COMPLETED);
+	wait_event(commit->commit_wait, commit->state >= min_state);
 }
 
 int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid)
@@ -884,7 +890,7 @@  int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid)
 			goto out;  /* nothing committing|committed */
 	}
 
-	wait_for_commit(cur_trans);
+	wait_for_commit(cur_trans, TRANS_STATE_COMPLETED);
 	btrfs_put_transaction(cur_trans);
 out:
 	return ret;
@@ -2102,11 +2108,15 @@  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 
 	spin_lock(&fs_info->trans_lock);
 	if (cur_trans->state >= TRANS_STATE_COMMIT_START) {
+		enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
+
 		spin_unlock(&fs_info->trans_lock);
 		refcount_inc(&cur_trans->use_count);
-		ret = btrfs_end_transaction(trans);
 
-		wait_for_commit(cur_trans);
+		if (trans->in_fsync)
+			want_state = TRANS_STATE_SUPER_COMMITTED;
+		ret = btrfs_end_transaction(trans);
+		wait_for_commit(cur_trans, want_state);
 
 		if (TRANS_ABORTED(cur_trans))
 			ret = cur_trans->aborted;
@@ -2120,13 +2130,19 @@  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	wake_up(&fs_info->transaction_blocked_wait);
 
 	if (cur_trans->list.prev != &fs_info->trans_list) {
+		enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
+
+		if (trans->in_fsync)
+			want_state = TRANS_STATE_SUPER_COMMITTED;
+
 		prev_trans = list_entry(cur_trans->list.prev,
 					struct btrfs_transaction, list);
-		if (prev_trans->state != TRANS_STATE_COMPLETED) {
+		if (prev_trans->state < want_state) {
 			refcount_inc(&prev_trans->use_count);
 			spin_unlock(&fs_info->trans_lock);
 
-			wait_for_commit(prev_trans);
+			wait_for_commit(prev_trans, want_state);
+
 			ret = READ_ONCE(prev_trans->aborted);
 
 			btrfs_put_transaction(prev_trans);
@@ -2345,6 +2361,13 @@  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	if (ret)
 		goto scrub_continue;
 
+	/*
+	 * We needn't acquire the lock here because there is no other task
+	 * which can change it.
+	 */
+	cur_trans->state = TRANS_STATE_SUPER_COMMITTED;
+	wake_up(&cur_trans->commit_wait);
+
 	btrfs_finish_extent_commit(trans);
 
 	if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &cur_trans->flags))
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 31ca81bad822..935bd6958a8a 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -16,6 +16,7 @@  enum btrfs_trans_state {
 	TRANS_STATE_COMMIT_START,
 	TRANS_STATE_COMMIT_DOING,
 	TRANS_STATE_UNBLOCKED,
+	TRANS_STATE_SUPER_COMMITTED,
 	TRANS_STATE_COMPLETED,
 	TRANS_STATE_MAX,
 };
@@ -133,6 +134,7 @@  struct btrfs_trans_handle {
 	bool can_flush_pending_bgs;
 	bool reloc_reserved;
 	bool dirty;
+	bool in_fsync;
 	struct btrfs_root *root;
 	struct btrfs_fs_info *fs_info;
 	struct list_head new_bgs;