diff mbox series

[6/6] btrfs: do not block inode logging for so long during transaction commit

Message ID d9df3c01bd2fbfeddfe205fa229ecea2d7478711.1606305501.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: some performance improvements for dbench alike workloads | expand

Commit Message

Filipe Manana Nov. 25, 2020, 12:19 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Early on during a transaction commit we acquire the tree_log_mutex and
hold it until after we write the super blocks. But before writing the
extent buffers dirtied by the transaction and the super blocks we unblock
the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
fs_info->running_transaction to NULL.

This means that after that and before writing the super blocks, new
transactions can start. However if any transaction wants to log an inode,
it will block waiting for the transaction commit to write its dirty
extent buffers and the super blocks because the tree_log_mutex is only
released after those operations are complete, and starting a new log
transaction blocks on that mutex (at start_log_trans()).

Writing the dirty extent buffers and the super blocks can take a very
significant amount of time to complete, but we could allow the tasks
wanting to log an inode to proceed with most of their steps:

1) create the log trees
2) log metadata in the trees
3) write their dirty extent buffers

They only need to wait for the previous transaction commit to complete
(write its super blocks) before they attempt to write their super blocks,
otherwise we could end up with a corrupt filesystem after a crash

So change start_log_trans() to use the root tree's log_mutex to serialize
for the creation of the log root tree instead of using the tree_log_mutex,
and make btrfs_sync_log() acquire the tree_log_mutex before writing the
super blocks. This allows for inode logging to wait much less time when
there is a previous transaction that is still committing, often not having
to wait at all, as by the time when we try to sync the log the previous
transaction already wrote its super blocks.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

The following script that uses dbench was used to measure the impact of
the whole patchset:

  $ cat test-dbench.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/btrfs
  MOUNT_OPTIONS="-o ssd"

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  mkfs.btrfs -f -m single -d single $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 300 64

  umount $MNT

The test was run on a machine with 12 cores, 64G of ram, using a NVMe
device and a non-debug kernel configuration (Debian's default).

Before patch set:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    11277211    0.250    85.340
 Close        8283172     0.002     6.479
 Rename        477515     1.935    86.026
 Unlink       2277936     0.770    87.071
 Deltree          256    15.732    81.379
 Mkdir            128     0.003     0.009
 Qpathinfo    10221180    0.056    44.404
 Qfileinfo    1789967     0.002     4.066
 Qfsinfo      1874399     0.003     9.176
 Sfileinfo     918589     0.061    10.247
 Find         3951758     0.341    54.040
 WriteX       5616547     0.047    85.079
 ReadX        17676028    0.005     9.704
 LockX          36704     0.003     1.800
 UnlockX        36704     0.002     0.687
 Flush         790541    14.115   676.236

Throughput 1179.19 MB/sec  64 clients  64 procs  max_latency=676.240 ms

After patch set:

Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    12687926    0.171    86.526
 Close        9320780     0.002     8.063
 Rename        537253     1.444    78.576
 Unlink       2561827     0.559    87.228
 Deltree          374    11.499    73.549
 Mkdir            187     0.003     0.005
 Qpathinfo    11500300    0.061    36.801
 Qfileinfo    2017118     0.002     7.189
 Qfsinfo      2108641     0.003     4.825
 Sfileinfo    1033574     0.008     8.065
 Find         4446553     0.408    47.835
 WriteX       6335667     0.045    84.388
 ReadX        19887312    0.003     9.215
 LockX          41312     0.003     1.394
 UnlockX        41312     0.002     1.425
 Flush         889233    13.014   623.259

Throughput 1339.32 MB/sec  64 clients  64 procs  max_latency=623.265 ms

+12.7% throughput, -8.2% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h    |  2 +-
 fs/btrfs/tree-log.c | 56 +++++++++++++++++++++++++++++++--------------
 2 files changed, 40 insertions(+), 18 deletions(-)

Comments

Wang Yugui Feb. 2, 2021, 5:38 a.m. UTC | #1
Hi, Filipe Manana

The dbench result with these patches is very good. thanks a lot. 

This is the dbench(synchronous mode) result , and then a question.

command: dbench -s -t 60 -D /btrfs/ 32
mount option:ssd,space_cache=v2
kernel:5.10.12 + patchset 1 + this patchset
patchset 1:
0001-btrfs-fix-race-causing-unnecessary-inode-logging-dur.patch
0002-btrfs-fix-race-that-results-in-logging-old-extents-d.patch
0003-btrfs-fix-race-that-causes-unnecessary-logging-of-an.patch
0004-btrfs-fix-race-that-makes-inode-logging-fallback-to-.patch
0005-btrfs-fix-race-leading-to-unnecessary-transaction-co.patch
0006-btrfs-do-not-block-inode-logging-for-so-long-during-.patch

We get two types of result as below, and the result type 1 is not easy
to reproduce now.

Question:
for synchronous mode, the result type 1 is perfect?
and there is still some minor place about the flush to do for
the result type2?


result type 1:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     868942     0.028     3.017
 Close         638536     0.003     0.061
 Rename         36851     0.663     4.000
 Unlink        175182     0.399     5.358
 Qpathinfo     789014     0.014     1.846
 Qfileinfo     137684     0.002     0.047
 Qfsinfo       144241     0.004     0.059
 Sfileinfo      70913     0.008     0.046
 Find          304554     0.057     1.889
** WriteX        429696     3.960  2239.973
 ReadX        1363356     0.005     0.358
 LockX           2836     0.004     0.038
 UnlockX         2836     0.002     0.018
** Flush          60771     0.621     6.794

Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
+ stat -f -c %T /btrfs/
btrfs
+ uname -r
5.10.12-4.el7.x86_64


result type 2:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     888943     0.028     2.679
 Close         652765     0.002     0.058
 Rename         37705     0.572     3.962
 Unlink        179713     0.383     3.983
 Qpathinfo     806705     0.014     2.294
 Qfileinfo     140752     0.002     0.125
 Qfsinfo       147909     0.004     0.049
 Sfileinfo      72374     0.008     0.104
 Find          311839     0.058     2.305
** WriteX        439656     3.854  1872.109
 ReadX        1396868     0.005     0.324
 LockX           2910     0.004     0.026
 UnlockX         2910     0.002     0.025
** Flush          62260     0.750  1659.364

Throughput 461.856 MB/sec (sync open)  32 clients  32 procs  max_latency=1872.118 ms
+ stat -f -c %T /btrfs/
btrfs
+ uname -r
5.10.12-4.el7.x86_64



Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/02

> From: Filipe Manana <fdmanana@suse.com>
> 
> Early on during a transaction commit we acquire the tree_log_mutex and
> hold it until after we write the super blocks. But before writing the
> extent buffers dirtied by the transaction and the super blocks we unblock
> the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
> fs_info->running_transaction to NULL.
> 
> This means that after that and before writing the super blocks, new
> transactions can start. However if any transaction wants to log an inode,
> it will block waiting for the transaction commit to write its dirty
> extent buffers and the super blocks because the tree_log_mutex is only
> released after those operations are complete, and starting a new log
> transaction blocks on that mutex (at start_log_trans()).
> 
> Writing the dirty extent buffers and the super blocks can take a very
> significant amount of time to complete, but we could allow the tasks
> wanting to log an inode to proceed with most of their steps:
> 
> 1) create the log trees
> 2) log metadata in the trees
> 3) write their dirty extent buffers
> 
> They only need to wait for the previous transaction commit to complete
> (write its super blocks) before they attempt to write their super blocks,
> otherwise we could end up with a corrupt filesystem after a crash
> 
> So change start_log_trans() to use the root tree's log_mutex to serialize
> for the creation of the log root tree instead of using the tree_log_mutex,
> and make btrfs_sync_log() acquire the tree_log_mutex before writing the
> super blocks. This allows for inode logging to wait much less time when
> there is a previous transaction that is still committing, often not having
> to wait at all, as by the time when we try to sync the log the previous
> transaction already wrote its super blocks.
> 
> This patch belongs to a patch set that is comprised of the following
> patches:
> 
>   btrfs: fix race causing unnecessary inode logging during link and rename
>   btrfs: fix race that results in logging old extents during a fast fsync
>   btrfs: fix race that causes unnecessary logging of ancestor inodes
>   btrfs: fix race that makes inode logging fallback to transaction commit
>   btrfs: fix race leading to unnecessary transaction commit when logging inode
>   btrfs: do not block inode logging for so long during transaction commit
> 
> The following script that uses dbench was used to measure the impact of
> the whole patchset:
> 
>   $ cat test-dbench.sh
>   #!/bin/bash
> 
>   DEV=/dev/nvme0n1
>   MNT=/mnt/btrfs
>   MOUNT_OPTIONS="-o ssd"
> 
>   echo "performance" | \
>       tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> 
>   mkfs.btrfs -f -m single -d single $DEV
>   mount $MOUNT_OPTIONS $DEV $MNT
> 
>   dbench -D $MNT -t 300 64
> 
>   umount $MNT
> 
> The test was run on a machine with 12 cores, 64G of ram, using a NVMe
> device and a non-debug kernel configuration (Debian's default).
> 
> Before patch set:
> 
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX    11277211    0.250    85.340
>  Close        8283172     0.002     6.479
>  Rename        477515     1.935    86.026
>  Unlink       2277936     0.770    87.071
>  Deltree          256    15.732    81.379
>  Mkdir            128     0.003     0.009
>  Qpathinfo    10221180    0.056    44.404
>  Qfileinfo    1789967     0.002     4.066
>  Qfsinfo      1874399     0.003     9.176
>  Sfileinfo     918589     0.061    10.247
>  Find         3951758     0.341    54.040
>  WriteX       5616547     0.047    85.079
>  ReadX        17676028    0.005     9.704
>  LockX          36704     0.003     1.800
>  UnlockX        36704     0.002     0.687
>  Flush         790541    14.115   676.236
> 
> Throughput 1179.19 MB/sec  64 clients  64 procs  max_latency=676.240 ms
> 
> After patch set:
> 
> Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX    12687926    0.171    86.526
>  Close        9320780     0.002     8.063
>  Rename        537253     1.444    78.576
>  Unlink       2561827     0.559    87.228
>  Deltree          374    11.499    73.549
>  Mkdir            187     0.003     0.005
>  Qpathinfo    11500300    0.061    36.801
>  Qfileinfo    2017118     0.002     7.189
>  Qfsinfo      2108641     0.003     4.825
>  Sfileinfo    1033574     0.008     8.065
>  Find         4446553     0.408    47.835
>  WriteX       6335667     0.045    84.388
>  ReadX        19887312    0.003     9.215
>  LockX          41312     0.003     1.394
>  UnlockX        41312     0.002     1.425
>  Flush         889233    13.014   623.259
> 
> Throughput 1339.32 MB/sec  64 clients  64 procs  max_latency=623.265 ms
> 
> +12.7% throughput, -8.2% max latency
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/ctree.h    |  2 +-
>  fs/btrfs/tree-log.c | 56 +++++++++++++++++++++++++++++++--------------
>  2 files changed, 40 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index c0c6e79c43f9..7185384f475a 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1026,7 +1026,7 @@ enum {
>  	BTRFS_ROOT_DEAD_RELOC_TREE,
>  	/* Mark dead root stored on device whose cleanup needs to be resumed */
>  	BTRFS_ROOT_DEAD_TREE,
> -	/* The root has a log tree. Used only for subvolume roots. */
> +	/* The root has a log tree. Used for subvolume roots and the tree root. */
>  	BTRFS_ROOT_HAS_LOG_TREE,
>  	/* Qgroup flushing is in progress */
>  	BTRFS_ROOT_QGROUP_FLUSHING,
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index bc5b652f4f64..e8b84543d565 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -139,8 +139,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>  			   struct btrfs_log_ctx *ctx)
>  {
>  	struct btrfs_fs_info *fs_info = root->fs_info;
> +	struct btrfs_root *tree_root = fs_info->tree_root;
>  	int ret = 0;
>  
> +	/*
> +	 * First check if the log root tree was already created. If not, create
> +	 * it before locking the root's log_mutex, just to keep lockdep happy.
> +	 */
> +	if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state)) {
> +		mutex_lock(&tree_root->log_mutex);
> +		if (!fs_info->log_root_tree) {
> +			ret = btrfs_init_log_root_tree(trans, fs_info);
> +			if (!ret)
> +				set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state);
> +		}
> +		mutex_unlock(&tree_root->log_mutex);
> +		if (ret)
> +			return ret;
> +	}
> +
>  	mutex_lock(&root->log_mutex);
>  
>  	if (root->log_root) {
> @@ -156,13 +173,6 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>  			set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
>  		}
>  	} else {
> -		mutex_lock(&fs_info->tree_log_mutex);
> -		if (!fs_info->log_root_tree)
> -			ret = btrfs_init_log_root_tree(trans, fs_info);
> -		mutex_unlock(&fs_info->tree_log_mutex);
> -		if (ret)
> -			goto out;
> -
>  		ret = btrfs_add_log_tree(trans, root);
>  		if (ret)
>  			goto out;
> @@ -3022,6 +3032,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	int log_transid = 0;
>  	struct btrfs_log_ctx root_log_ctx;
>  	struct blk_plug plug;
> +	u64 log_root_start;
> +	u64 log_root_level;
>  
>  	mutex_lock(&root->log_mutex);
>  	log_transid = ctx->log_transid;
> @@ -3199,22 +3211,31 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  		goto out_wake_log_root;
>  	}
>  
> -	btrfs_set_super_log_root(fs_info->super_for_commit,
> -				 log_root_tree->node->start);
> -	btrfs_set_super_log_root_level(fs_info->super_for_commit,
> -				       btrfs_header_level(log_root_tree->node));
> -
> +	log_root_start = log_root_tree->node->start;
> +	log_root_level = btrfs_header_level(log_root_tree->node);
>  	log_root_tree->log_transid++;
>  	mutex_unlock(&log_root_tree->log_mutex);
>  
>  	/*
> -	 * Nobody else is going to jump in and write the ctree
> -	 * super here because the log_commit atomic below is protecting
> -	 * us.  We must be called with a transaction handle pinning
> -	 * the running transaction open, so a full commit can't hop
> -	 * in and cause problems either.
> +	 * Here we are guaranteed that nobody is going to write the superblock
> +	 * for the current transaction before us and that neither we do write
> +	 * our superblock before the previous transaction finishes its commit
> +	 * and writes its superblock, because:
> +	 *
> +	 * 1) We are holding a handle on the current transaction, so no body
> +	 *    can commit it until we release the handle;
> +	 *
> +	 * 2) Before writing our superblock we acquire the tree_log_mutex, so
> +	 *    if the previous transaction is still committing, and hasn't yet
> +	 *    written its superblock, we wait for it to do it, because a
> +	 *    transaction commit acquires the tree_log_mutex when the commit
> +	 *    begins and releases it only after writing its superblock.
>  	 */
> +	mutex_lock(&fs_info->tree_log_mutex);
> +	btrfs_set_super_log_root(fs_info->super_for_commit, log_root_start);
> +	btrfs_set_super_log_root_level(fs_info->super_for_commit, log_root_level);
>  	ret = write_all_supers(fs_info, 1);
> +	mutex_unlock(&fs_info->tree_log_mutex);
>  	if (ret) {
>  		btrfs_set_log_full_commit(trans);
>  		btrfs_abort_transaction(trans, ret);
> @@ -3299,6 +3320,7 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
>  	if (fs_info->log_root_tree) {
>  		free_log_tree(trans, fs_info->log_root_tree);
>  		fs_info->log_root_tree = NULL;
> +		clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &fs_info->tree_root->state);
>  	}
>  	return 0;
>  }
> -- 
> 2.28.0
Filipe Manana Feb. 2, 2021, 11:28 a.m. UTC | #2
On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi, Filipe Manana
>
> The dbench result with these patches is very good. thanks a lot.
>
> This is the dbench(synchronous mode) result , and then a question.
>
> command: dbench -s -t 60 -D /btrfs/ 32
> mount option:ssd,space_cache=v2
> kernel:5.10.12 + patchset 1 + this patchset

patchset 1 and "this patchset" are the same, did you mean two
different patchsets or just a single patchset?

> patchset 1:
> 0001-btrfs-fix-race-causing-unnecessary-inode-logging-dur.patch
> 0002-btrfs-fix-race-that-results-in-logging-old-extents-d.patch
> 0003-btrfs-fix-race-that-causes-unnecessary-logging-of-an.patch
> 0004-btrfs-fix-race-that-makes-inode-logging-fallback-to-.patch
> 0005-btrfs-fix-race-leading-to-unnecessary-transaction-co.patch
> 0006-btrfs-do-not-block-inode-logging-for-so-long-during-.patch
>
> We get two types of result as below, and the result type 1 is not easy
> to reproduce now.

Ok, there are outliers often, that's why multiple tests should be done.

>
> Question:
> for synchronous mode, the result type 1 is perfect?

What do you mean by perfect? You mean if result 1 is better than result 2?

> and there is still some minor place about the flush to do for
> the result type2?

By "minor place" you mean the huge difference I suppose.

>
>
> result type 1:
>
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX     868942     0.028     3.017
>  Close         638536     0.003     0.061
>  Rename         36851     0.663     4.000
>  Unlink        175182     0.399     5.358
>  Qpathinfo     789014     0.014     1.846
>  Qfileinfo     137684     0.002     0.047
>  Qfsinfo       144241     0.004     0.059
>  Sfileinfo      70913     0.008     0.046
>  Find          304554     0.057     1.889
> ** WriteX        429696     3.960  2239.973
>  ReadX        1363356     0.005     0.358
>  LockX           2836     0.004     0.038
>  UnlockX         2836     0.002     0.018
> ** Flush          60771     0.621     6.794
>
> Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
> + stat -f -c %T /btrfs/
> btrfs
> + uname -r
> 5.10.12-4.el7.x86_64
>
>
> result type 2:
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX     888943     0.028     2.679
>  Close         652765     0.002     0.058
>  Rename         37705     0.572     3.962
>  Unlink        179713     0.383     3.983
>  Qpathinfo     806705     0.014     2.294
>  Qfileinfo     140752     0.002     0.125
>  Qfsinfo       147909     0.004     0.049
>  Sfileinfo      72374     0.008     0.104
>  Find          311839     0.058     2.305
> ** WriteX        439656     3.854  1872.109
>  ReadX        1396868     0.005     0.324
>  LockX           2910     0.004     0.026
>  UnlockX         2910     0.002     0.025
> ** Flush          62260     0.750  1659.364
>
> Throughput 461.856 MB/sec (sync open)  32 clients  32 procs  max_latency=1872.118 ms
> + stat -f -c %T /btrfs/
> btrfs
> + uname -r
> 5.10.12-4.el7.x86_64

I'm not sure what your question is exactly.

Are both results after applying the same patchset, or are they before
and after applying the patchset, respectively?

If they are both with the patchset applied, and you wonder about the
big variation in the "Flush" operations, I am not sure about why it is
so.
Both throughput and max latency are better in result 2.

It's normal to have variations across dbench runs, I get them too, and
I do several runs (5 or 6) to check things out.

I don't use virtualization (testing on bare metal), I set the cpu
governor mode to performance (instead of the "powersave" default) and
use a non-debug kernel configuration, because otherwise I get
significant variations in latencies and throughput too (though I never
got a huge difference such as from 6.794 to 1659.364).

Thanks.

>
>
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/02
>
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Early on during a transaction commit we acquire the tree_log_mutex and
> > hold it until after we write the super blocks. But before writing the
> > extent buffers dirtied by the transaction and the super blocks we unblock
> > the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
> > fs_info->running_transaction to NULL.
> >
> > This means that after that and before writing the super blocks, new
> > transactions can start. However if any transaction wants to log an inode,
> > it will block waiting for the transaction commit to write its dirty
> > extent buffers and the super blocks because the tree_log_mutex is only
> > released after those operations are complete, and starting a new log
> > transaction blocks on that mutex (at start_log_trans()).
> >
> > Writing the dirty extent buffers and the super blocks can take a very
> > significant amount of time to complete, but we could allow the tasks
> > wanting to log an inode to proceed with most of their steps:
> >
> > 1) create the log trees
> > 2) log metadata in the trees
> > 3) write their dirty extent buffers
> >
> > They only need to wait for the previous transaction commit to complete
> > (write its super blocks) before they attempt to write their super blocks,
> > otherwise we could end up with a corrupt filesystem after a crash
> >
> > So change start_log_trans() to use the root tree's log_mutex to serialize
> > for the creation of the log root tree instead of using the tree_log_mutex,
> > and make btrfs_sync_log() acquire the tree_log_mutex before writing the
> > super blocks. This allows for inode logging to wait much less time when
> > there is a previous transaction that is still committing, often not having
> > to wait at all, as by the time when we try to sync the log the previous
> > transaction already wrote its super blocks.
> >
> > This patch belongs to a patch set that is comprised of the following
> > patches:
> >
> >   btrfs: fix race causing unnecessary inode logging during link and rename
> >   btrfs: fix race that results in logging old extents during a fast fsync
> >   btrfs: fix race that causes unnecessary logging of ancestor inodes
> >   btrfs: fix race that makes inode logging fallback to transaction commit
> >   btrfs: fix race leading to unnecessary transaction commit when logging inode
> >   btrfs: do not block inode logging for so long during transaction commit
> >
> > The following script that uses dbench was used to measure the impact of
> > the whole patchset:
> >
> >   $ cat test-dbench.sh
> >   #!/bin/bash
> >
> >   DEV=/dev/nvme0n1
> >   MNT=/mnt/btrfs
> >   MOUNT_OPTIONS="-o ssd"
> >
> >   echo "performance" | \
> >       tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> >
> >   mkfs.btrfs -f -m single -d single $DEV
> >   mount $MOUNT_OPTIONS $DEV $MNT
> >
> >   dbench -D $MNT -t 300 64
> >
> >   umount $MNT
> >
> > The test was run on a machine with 12 cores, 64G of ram, using a NVMe
> > device and a non-debug kernel configuration (Debian's default).
> >
> > Before patch set:
> >
> >  Operation      Count    AvgLat    MaxLat
> >  ----------------------------------------
> >  NTCreateX    11277211    0.250    85.340
> >  Close        8283172     0.002     6.479
> >  Rename        477515     1.935    86.026
> >  Unlink       2277936     0.770    87.071
> >  Deltree          256    15.732    81.379
> >  Mkdir            128     0.003     0.009
> >  Qpathinfo    10221180    0.056    44.404
> >  Qfileinfo    1789967     0.002     4.066
> >  Qfsinfo      1874399     0.003     9.176
> >  Sfileinfo     918589     0.061    10.247
> >  Find         3951758     0.341    54.040
> >  WriteX       5616547     0.047    85.079
> >  ReadX        17676028    0.005     9.704
> >  LockX          36704     0.003     1.800
> >  UnlockX        36704     0.002     0.687
> >  Flush         790541    14.115   676.236
> >
> > Throughput 1179.19 MB/sec  64 clients  64 procs  max_latency=676.240 ms
> >
> > After patch set:
> >
> > Operation      Count    AvgLat    MaxLat
> >  ----------------------------------------
> >  NTCreateX    12687926    0.171    86.526
> >  Close        9320780     0.002     8.063
> >  Rename        537253     1.444    78.576
> >  Unlink       2561827     0.559    87.228
> >  Deltree          374    11.499    73.549
> >  Mkdir            187     0.003     0.005
> >  Qpathinfo    11500300    0.061    36.801
> >  Qfileinfo    2017118     0.002     7.189
> >  Qfsinfo      2108641     0.003     4.825
> >  Sfileinfo    1033574     0.008     8.065
> >  Find         4446553     0.408    47.835
> >  WriteX       6335667     0.045    84.388
> >  ReadX        19887312    0.003     9.215
> >  LockX          41312     0.003     1.394
> >  UnlockX        41312     0.002     1.425
> >  Flush         889233    13.014   623.259
> >
> > Throughput 1339.32 MB/sec  64 clients  64 procs  max_latency=623.265 ms
> >
> > +12.7% throughput, -8.2% max latency
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/ctree.h    |  2 +-
> >  fs/btrfs/tree-log.c | 56 +++++++++++++++++++++++++++++++--------------
> >  2 files changed, 40 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index c0c6e79c43f9..7185384f475a 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1026,7 +1026,7 @@ enum {
> >       BTRFS_ROOT_DEAD_RELOC_TREE,
> >       /* Mark dead root stored on device whose cleanup needs to be resumed */
> >       BTRFS_ROOT_DEAD_TREE,
> > -     /* The root has a log tree. Used only for subvolume roots. */
> > +     /* The root has a log tree. Used for subvolume roots and the tree root. */
> >       BTRFS_ROOT_HAS_LOG_TREE,
> >       /* Qgroup flushing is in progress */
> >       BTRFS_ROOT_QGROUP_FLUSHING,
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index bc5b652f4f64..e8b84543d565 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -139,8 +139,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >                          struct btrfs_log_ctx *ctx)
> >  {
> >       struct btrfs_fs_info *fs_info = root->fs_info;
> > +     struct btrfs_root *tree_root = fs_info->tree_root;
> >       int ret = 0;
> >
> > +     /*
> > +      * First check if the log root tree was already created. If not, create
> > +      * it before locking the root's log_mutex, just to keep lockdep happy.
> > +      */
> > +     if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state)) {
> > +             mutex_lock(&tree_root->log_mutex);
> > +             if (!fs_info->log_root_tree) {
> > +                     ret = btrfs_init_log_root_tree(trans, fs_info);
> > +                     if (!ret)
> > +                             set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state);
> > +             }
> > +             mutex_unlock(&tree_root->log_mutex);
> > +             if (ret)
> > +                     return ret;
> > +     }
> > +
> >       mutex_lock(&root->log_mutex);
> >
> >       if (root->log_root) {
> > @@ -156,13 +173,6 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >                       set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
> >               }
> >       } else {
> > -             mutex_lock(&fs_info->tree_log_mutex);
> > -             if (!fs_info->log_root_tree)
> > -                     ret = btrfs_init_log_root_tree(trans, fs_info);
> > -             mutex_unlock(&fs_info->tree_log_mutex);
> > -             if (ret)
> > -                     goto out;
> > -
> >               ret = btrfs_add_log_tree(trans, root);
> >               if (ret)
> >                       goto out;
> > @@ -3022,6 +3032,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> >       int log_transid = 0;
> >       struct btrfs_log_ctx root_log_ctx;
> >       struct blk_plug plug;
> > +     u64 log_root_start;
> > +     u64 log_root_level;
> >
> >       mutex_lock(&root->log_mutex);
> >       log_transid = ctx->log_transid;
> > @@ -3199,22 +3211,31 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> >               goto out_wake_log_root;
> >       }
> >
> > -     btrfs_set_super_log_root(fs_info->super_for_commit,
> > -                              log_root_tree->node->start);
> > -     btrfs_set_super_log_root_level(fs_info->super_for_commit,
> > -                                    btrfs_header_level(log_root_tree->node));
> > -
> > +     log_root_start = log_root_tree->node->start;
> > +     log_root_level = btrfs_header_level(log_root_tree->node);
> >       log_root_tree->log_transid++;
> >       mutex_unlock(&log_root_tree->log_mutex);
> >
> >       /*
> > -      * Nobody else is going to jump in and write the ctree
> > -      * super here because the log_commit atomic below is protecting
> > -      * us.  We must be called with a transaction handle pinning
> > -      * the running transaction open, so a full commit can't hop
> > -      * in and cause problems either.
> > +      * Here we are guaranteed that nobody is going to write the superblock
> > +      * for the current transaction before us and that neither we do write
> > +      * our superblock before the previous transaction finishes its commit
> > +      * and writes its superblock, because:
> > +      *
> > +      * 1) We are holding a handle on the current transaction, so no body
> > +      *    can commit it until we release the handle;
> > +      *
> > +      * 2) Before writing our superblock we acquire the tree_log_mutex, so
> > +      *    if the previous transaction is still committing, and hasn't yet
> > +      *    written its superblock, we wait for it to do it, because a
> > +      *    transaction commit acquires the tree_log_mutex when the commit
> > +      *    begins and releases it only after writing its superblock.
> >        */
> > +     mutex_lock(&fs_info->tree_log_mutex);
> > +     btrfs_set_super_log_root(fs_info->super_for_commit, log_root_start);
> > +     btrfs_set_super_log_root_level(fs_info->super_for_commit, log_root_level);
> >       ret = write_all_supers(fs_info, 1);
> > +     mutex_unlock(&fs_info->tree_log_mutex);
> >       if (ret) {
> >               btrfs_set_log_full_commit(trans);
> >               btrfs_abort_transaction(trans, ret);
> > @@ -3299,6 +3320,7 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
> >       if (fs_info->log_root_tree) {
> >               free_log_tree(trans, fs_info->log_root_tree);
> >               fs_info->log_root_tree = NULL;
> > +             clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &fs_info->tree_root->state);
> >       }
> >       return 0;
> >  }
> > --
> > 2.28.0
>
>
Wang Yugui Feb. 2, 2021, 1:01 p.m. UTC | #3
Hi, Filipe Manana

> On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> >
> > Hi, Filipe Manana
> >
> > The dbench result with these patches is very good. thanks a lot.
> >
> > This is the dbench(synchronous mode) result , and then a question.
> >
> > command: dbench -s -t 60 -D /btrfs/ 32
> > mount option:ssd,space_cache=v2
> > kernel:5.10.12 + patchset 1 + this patchset
> 
> patchset 1 and "this patchset" are the same, did you mean two
> different patchsets or just a single patchset?

patchset1:
btrfs: some performance improvements for dbench alike workloads

patchset2:
btrfs: more performance improvements for dbench workloads
https://patchwork.kernel.org/project/linux-btrfs/list/?series=422801

I'm sorroy that I have replayed to the wrong patchset.

> 
> >
> > Question:
> > for synchronous mode, the result type 1 is perfect?
> 
> What do you mean by perfect? You mean if result 1 is better than result 2?

In result 1,  the MaxLat of Flush of dbench synchronous mode is fast as
expected, the same level as  kernel 5.4.91.

But in result 2, the MaxLat of Flush of dbench synchronous mode is big
as write level, but this is synchronous mode, most job should be done
already before flush.

> > and there is still some minor place about the flush to do for
> > the result type2?
> 
> By "minor place" you mean the huge difference I suppose.
> 
> >
> >
> > result type 1:
> >
> >  Operation      Count    AvgLat    MaxLat
> >  ----------------------------------------
> >  NTCreateX     868942     0.028     3.017
> >  Close         638536     0.003     0.061
> >  Rename         36851     0.663     4.000
> >  Unlink        175182     0.399     5.358
> >  Qpathinfo     789014     0.014     1.846
> >  Qfileinfo     137684     0.002     0.047
> >  Qfsinfo       144241     0.004     0.059
> >  Sfileinfo      70913     0.008     0.046
> >  Find          304554     0.057     1.889
> > ** WriteX        429696     3.960  2239.973
> >  ReadX        1363356     0.005     0.358
> >  LockX           2836     0.004     0.038
> >  UnlockX         2836     0.002     0.018
> > ** Flush          60771     0.621     6.794
> >
> > Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
> > + stat -f -c %T /btrfs/
> > btrfs
> > + uname -r
> > 5.10.12-4.el7.x86_64
> >
> >
> > result type 2:
> >  Operation      Count    AvgLat    MaxLat
> >  ----------------------------------------
> >  NTCreateX     888943     0.028     2.679
> >  Close         652765     0.002     0.058
> >  Rename         37705     0.572     3.962
> >  Unlink        179713     0.383     3.983
> >  Qpathinfo     806705     0.014     2.294
> >  Qfileinfo     140752     0.002     0.125
> >  Qfsinfo       147909     0.004     0.049
> >  Sfileinfo      72374     0.008     0.104
> >  Find          311839     0.058     2.305
> > ** WriteX        439656     3.854  1872.109
> >  ReadX        1396868     0.005     0.324
> >  LockX           2910     0.004     0.026
> >  UnlockX         2910     0.002     0.025
> > ** Flush          62260     0.750  1659.364
> >
> > Throughput 461.856 MB/sec (sync open)  32 clients  32 procs  max_latency=1872.118 ms
> > + stat -f -c %T /btrfs/
> > btrfs
> > + uname -r
> > 5.10.12-4.el7.x86_64
> 
> I'm not sure what your question is exactly.
> 
> Are both results after applying the same patchset, or are they before
> and after applying the patchset, respectively?

Both result after applying the same patchset.
and both on the same server, same SAS SSD disk.
but the result is not stable, and the major diff is MaxLat of Flush.

Server:Dell T7610
CPU: E5-2660 v2(10core 20threads) x2
SSD:TOSHIBA  PX05SMQ040
Memory:192G (with ECC)


> If they are both with the patchset applied, and you wonder about the
> big variation in the "Flush" operations, I am not sure about why it is
> so.
> Both throughput and max latency are better in result 2.
> 
> It's normal to have variations across dbench runs, I get them too, and
> I do several runs (5 or 6) to check things out.
> 
> I don't use virtualization (testing on bare metal), I set the cpu
> governor mode to performance (instead of the "powersave" default) and
> use a non-debug kernel configuration, because otherwise I get
> significant variations in latencies and throughput too (though I never
> got a huge difference such as from 6.794 to 1659.364).

This is a bare metal(dell T7610).
CPU mode is set to performance by BIOS. and I checked it by
'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'

Maybe I used a SAS ssd, and the queue depth of SAS SSD is 254.
smaller than 1023 of a NVMe SSD,but it is still enough for
dbench 32 threads?


The huge difference of MaxLat of Flush such as from 6.794 to 1659.364 is
a problem.
It is not easy to re-product both,  mabye easy to reproduce the small
one, maybe easy to reproduce the big one.


Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/02
Wang Yugui Feb. 2, 2021, 2:09 p.m. UTC | #4
Hi, Filipe Manana

There are some dbench(sync mode) result on the same hardware,
but with different linux kernel

4.14.200
Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        225281     5.163    82.143
 Flush          32161     2.250    62.669
Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms

4.19.21
Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        118842    10.946   116.345
 Flush          16506     0.115    44.575
Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms

4.19.150
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        144509     9.151   117.353
 lush          20563     0.128    52.014
Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms

5.4.91
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        367033     4.377  1908.724
 Flush          52037     0.159    39.871
Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms

5.10.12+patches
Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        429696     3.960  2239.973
 Flush          60771     0.621     6.794
Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms


MaxLat / AvgLat  of WriteX is increased from 82.143/5.163=15.9  to
2239.973/3.960=565.6.

For QoS, can we have an option to tune the value of MaxLat / AvgLat  of
WriteX to less than 100?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/02

> Hi, Filipe Manana
> 
> > On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > >
> > > Hi, Filipe Manana
> > >
> > > The dbench result with these patches is very good. thanks a lot.
> > >
> > > This is the dbench(synchronous mode) result , and then a question.
> > >
> > > command: dbench -s -t 60 -D /btrfs/ 32
> > > mount option:ssd,space_cache=v2
> > > kernel:5.10.12 + patchset 1 + this patchset
> > 
> > patchset 1 and "this patchset" are the same, did you mean two
> > different patchsets or just a single patchset?
> 
> patchset1:
> btrfs: some performance improvements for dbench alike workloads
> 
> patchset2:
> btrfs: more performance improvements for dbench workloads
> https://patchwork.kernel.org/project/linux-btrfs/list/?series=422801
> 
> I'm sorroy that I have replayed to the wrong patchset.
> 
> > 
> > >
> > > Question:
> > > for synchronous mode, the result type 1 is perfect?
> > 
> > What do you mean by perfect? You mean if result 1 is better than result 2?
> 
> In result 1,  the MaxLat of Flush of dbench synchronous mode is fast as
> expected, the same level as  kernel 5.4.91.
> 
> But in result 2, the MaxLat of Flush of dbench synchronous mode is big
> as write level, but this is synchronous mode, most job should be done
> already before flush.
> 
> > > and there is still some minor place about the flush to do for
> > > the result type2?
> > 
> > By "minor place" you mean the huge difference I suppose.
> > 
> > >
> > >
> > > result type 1:
> > >
> > >  Operation      Count    AvgLat    MaxLat
> > >  ----------------------------------------
> > >  NTCreateX     868942     0.028     3.017
> > >  Close         638536     0.003     0.061
> > >  Rename         36851     0.663     4.000
> > >  Unlink        175182     0.399     5.358
> > >  Qpathinfo     789014     0.014     1.846
> > >  Qfileinfo     137684     0.002     0.047
> > >  Qfsinfo       144241     0.004     0.059
> > >  Sfileinfo      70913     0.008     0.046
> > >  Find          304554     0.057     1.889
> > > ** WriteX        429696     3.960  2239.973
> > >  ReadX        1363356     0.005     0.358
> > >  LockX           2836     0.004     0.038
> > >  UnlockX         2836     0.002     0.018
> > > ** Flush          60771     0.621     6.794
> > >
> > > Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
> > > + stat -f -c %T /btrfs/
> > > btrfs
> > > + uname -r
> > > 5.10.12-4.el7.x86_64
> > >
> > >
> > > result type 2:
> > >  Operation      Count    AvgLat    MaxLat
> > >  ----------------------------------------
> > >  NTCreateX     888943     0.028     2.679
> > >  Close         652765     0.002     0.058
> > >  Rename         37705     0.572     3.962
> > >  Unlink        179713     0.383     3.983
> > >  Qpathinfo     806705     0.014     2.294
> > >  Qfileinfo     140752     0.002     0.125
> > >  Qfsinfo       147909     0.004     0.049
> > >  Sfileinfo      72374     0.008     0.104
> > >  Find          311839     0.058     2.305
> > > ** WriteX        439656     3.854  1872.109
> > >  ReadX        1396868     0.005     0.324
> > >  LockX           2910     0.004     0.026
> > >  UnlockX         2910     0.002     0.025
> > > ** Flush          62260     0.750  1659.364
> > >
> > > Throughput 461.856 MB/sec (sync open)  32 clients  32 procs  max_latency=1872.118 ms
> > > + stat -f -c %T /btrfs/
> > > btrfs
> > > + uname -r
> > > 5.10.12-4.el7.x86_64
> > 
> > I'm not sure what your question is exactly.
> > 
> > Are both results after applying the same patchset, or are they before
> > and after applying the patchset, respectively?
> 
> Both result after applying the same patchset.
> and both on the same server, same SAS SSD disk.
> but the result is not stable, and the major diff is MaxLat of Flush.
> 
> Server:Dell T7610
> CPU: E5-2660 v2(10core 20threads) x2
> SSD:TOSHIBA  PX05SMQ040
> Memory:192G (with ECC)
> 
> 
> > If they are both with the patchset applied, and you wonder about the
> > big variation in the "Flush" operations, I am not sure about why it is
> > so.
> > Both throughput and max latency are better in result 2.
> > 
> > It's normal to have variations across dbench runs, I get them too, and
> > I do several runs (5 or 6) to check things out.
> > 
> > I don't use virtualization (testing on bare metal), I set the cpu
> > governor mode to performance (instead of the "powersave" default) and
> > use a non-debug kernel configuration, because otherwise I get
> > significant variations in latencies and throughput too (though I never
> > got a huge difference such as from 6.794 to 1659.364).
> 
> This is a bare metal(dell T7610).
> CPU mode is set to performance by BIOS. and I checked it by
> 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
> 
> Maybe I used a SAS ssd, and the queue depth of SAS SSD is 254.
> smaller than 1023 of a NVMe SSD,but it is still enough for
> dbench 32 threads?
> 
> 
> The huge difference of MaxLat of Flush such as from 6.794 to 1659.364 is
> a problem.
> It is not easy to re-product both,  mabye easy to reproduce the small
> one, maybe easy to reproduce the big one.
> 
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/02
>
Filipe Manana Feb. 3, 2021, 10:51 a.m. UTC | #5
On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi, Filipe Manana
>
> There are some dbench(sync mode) result on the same hardware,
> but with different linux kernel
>
> 4.14.200
> Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  WriteX        225281     5.163    82.143
>  Flush          32161     2.250    62.669
> Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms
>
> 4.19.21
> Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  WriteX        118842    10.946   116.345
>  Flush          16506     0.115    44.575
> Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms
>
> 4.19.150
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  WriteX        144509     9.151   117.353
>  lush          20563     0.128    52.014
> Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms
>
> 5.4.91
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  WriteX        367033     4.377  1908.724
>  Flush          52037     0.159    39.871
> Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms

Ok, it seems somewhere between 4.19 and 5.4, something made the
latency much worse for you at least.

Is it only when using sync open (O_SYNC, dbench's -s flag), what about
when not using it?

I'll have to look at it, but it will likely take some time.

Thanks.

>
> 5.10.12+patches
> Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  WriteX        429696     3.960  2239.973
>  Flush          60771     0.621     6.794
> Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
>
>
> MaxLat / AvgLat  of WriteX is increased from 82.143/5.163=15.9  to
> 2239.973/3.960=565.6.
>
> For QoS, can we have an option to tune the value of MaxLat / AvgLat  of
> WriteX to less than 100?
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/02
>
> > Hi, Filipe Manana
> >
> > > On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > > >
> > > > Hi, Filipe Manana
> > > >
> > > > The dbench result with these patches is very good. thanks a lot.
> > > >
> > > > This is the dbench(synchronous mode) result , and then a question.
> > > >
> > > > command: dbench -s -t 60 -D /btrfs/ 32
> > > > mount option:ssd,space_cache=v2
> > > > kernel:5.10.12 + patchset 1 + this patchset
> > >
> > > patchset 1 and "this patchset" are the same, did you mean two
> > > different patchsets or just a single patchset?
> >
> > patchset1:
> > btrfs: some performance improvements for dbench alike workloads
> >
> > patchset2:
> > btrfs: more performance improvements for dbench workloads
> > https://patchwork.kernel.org/project/linux-btrfs/list/?series=422801
> >
> > I'm sorroy that I have replayed to the wrong patchset.
> >
> > >
> > > >
> > > > Question:
> > > > for synchronous mode, the result type 1 is perfect?
> > >
> > > What do you mean by perfect? You mean if result 1 is better than result 2?
> >
> > In result 1,  the MaxLat of Flush of dbench synchronous mode is fast as
> > expected, the same level as  kernel 5.4.91.
> >
> > But in result 2, the MaxLat of Flush of dbench synchronous mode is big
> > as write level, but this is synchronous mode, most job should be done
> > already before flush.
> >
> > > > and there is still some minor place about the flush to do for
> > > > the result type2?
> > >
> > > By "minor place" you mean the huge difference I suppose.
> > >
> > > >
> > > >
> > > > result type 1:
> > > >
> > > >  Operation      Count    AvgLat    MaxLat
> > > >  ----------------------------------------
> > > >  NTCreateX     868942     0.028     3.017
> > > >  Close         638536     0.003     0.061
> > > >  Rename         36851     0.663     4.000
> > > >  Unlink        175182     0.399     5.358
> > > >  Qpathinfo     789014     0.014     1.846
> > > >  Qfileinfo     137684     0.002     0.047
> > > >  Qfsinfo       144241     0.004     0.059
> > > >  Sfileinfo      70913     0.008     0.046
> > > >  Find          304554     0.057     1.889
> > > > ** WriteX        429696     3.960  2239.973
> > > >  ReadX        1363356     0.005     0.358
> > > >  LockX           2836     0.004     0.038
> > > >  UnlockX         2836     0.002     0.018
> > > > ** Flush          60771     0.621     6.794
> > > >
> > > > Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
> > > > + stat -f -c %T /btrfs/
> > > > btrfs
> > > > + uname -r
> > > > 5.10.12-4.el7.x86_64
> > > >
> > > >
> > > > result type 2:
> > > >  Operation      Count    AvgLat    MaxLat
> > > >  ----------------------------------------
> > > >  NTCreateX     888943     0.028     2.679
> > > >  Close         652765     0.002     0.058
> > > >  Rename         37705     0.572     3.962
> > > >  Unlink        179713     0.383     3.983
> > > >  Qpathinfo     806705     0.014     2.294
> > > >  Qfileinfo     140752     0.002     0.125
> > > >  Qfsinfo       147909     0.004     0.049
> > > >  Sfileinfo      72374     0.008     0.104
> > > >  Find          311839     0.058     2.305
> > > > ** WriteX        439656     3.854  1872.109
> > > >  ReadX        1396868     0.005     0.324
> > > >  LockX           2910     0.004     0.026
> > > >  UnlockX         2910     0.002     0.025
> > > > ** Flush          62260     0.750  1659.364
> > > >
> > > > Throughput 461.856 MB/sec (sync open)  32 clients  32 procs  max_latency=1872.118 ms
> > > > + stat -f -c %T /btrfs/
> > > > btrfs
> > > > + uname -r
> > > > 5.10.12-4.el7.x86_64
> > >
> > > I'm not sure what your question is exactly.
> > >
> > > Are both results after applying the same patchset, or are they before
> > > and after applying the patchset, respectively?
> >
> > Both result after applying the same patchset.
> > and both on the same server, same SAS SSD disk.
> > but the result is not stable, and the major diff is MaxLat of Flush.
> >
> > Server:Dell T7610
> > CPU: E5-2660 v2(10core 20threads) x2
> > SSD:TOSHIBA  PX05SMQ040
> > Memory:192G (with ECC)
> >
> >
> > > If they are both with the patchset applied, and you wonder about the
> > > big variation in the "Flush" operations, I am not sure about why it is
> > > so.
> > > Both throughput and max latency are better in result 2.
> > >
> > > It's normal to have variations across dbench runs, I get them too, and
> > > I do several runs (5 or 6) to check things out.
> > >
> > > I don't use virtualization (testing on bare metal), I set the cpu
> > > governor mode to performance (instead of the "powersave" default) and
> > > use a non-debug kernel configuration, because otherwise I get
> > > significant variations in latencies and throughput too (though I never
> > > got a huge difference such as from 6.794 to 1659.364).
> >
> > This is a bare metal(dell T7610).
> > CPU mode is set to performance by BIOS. and I checked it by
> > 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
> >
> > Maybe I used a SAS ssd, and the queue depth of SAS SSD is 254.
> > smaller than 1023 of a NVMe SSD,but it is still enough for
> > dbench 32 threads?
> >
> >
> > The huge difference of MaxLat of Flush such as from 6.794 to 1659.364 is
> > a problem.
> > It is not easy to re-product both,  mabye easy to reproduce the small
> > one, maybe easy to reproduce the big one.
> >
> >
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/02/02
> >
>
>
Nikolay Borisov Feb. 3, 2021, 11:03 a.m. UTC | #6
On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
> On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>>
>> Hi, Filipe Manana
>>
>> There are some dbench(sync mode) result on the same hardware,
>> but with different linux kernel
>>
>> 4.14.200
>> Operation      Count    AvgLat    MaxLat
>>  ----------------------------------------
>>  WriteX        225281     5.163    82.143
>>  Flush          32161     2.250    62.669
>> Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms
>>
>> 4.19.21
>> Operation      Count    AvgLat    MaxLat
>>  ----------------------------------------
>>  WriteX        118842    10.946   116.345
>>  Flush          16506     0.115    44.575
>> Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms
>>
>> 4.19.150
>>  Operation      Count    AvgLat    MaxLat
>>  ----------------------------------------
>>  WriteX        144509     9.151   117.353
>>  lush          20563     0.128    52.014
>> Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms
>>
>> 5.4.91
>>  Operation      Count    AvgLat    MaxLat
>>  ----------------------------------------
>>  WriteX        367033     4.377  1908.724
>>  Flush          52037     0.159    39.871
>> Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms
> 
> Ok, it seems somewhere between 4.19 and 5.4, something made the
> latency much worse for you at least.
> 
> Is it only when using sync open (O_SYNC, dbench's -s flag), what about
> when not using it?
> 
> I'll have to look at it, but it will likely take some time.


This seems like the perf regression I observed starting with kernel 5.0,
essentially preemptive flush of metadata was broken for quite some time,
but kernel 5.0 removed a btrfs_end_transaction call from
should_end_transaction which unmasked the issue.

In particular this should have been fixed by the following commit in
misc-next:

https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
which is part of a larger series of patches. So Wang, in order to test
this hypothesis can you re-run those tests with the latest misc-next
branch .

<snip>
Wang Yugui Feb. 3, 2021, 3:16 p.m. UTC | #7
Hi,

There is the dbench(sync open) test result of misc-next(5.11-rc6 +81patches)

1, the MaxLat is changed from 1900ms level to 1000ms level.
    that is a good improvement.

2, It is OK that NTCreateX/Rename/Unlink/WriteX have the same level of
    MaxLat because all of them will write something to disk.

3, I'm not sure whether MaxLat of Flush is OK.
    there should be nothing to write for Flush because of 
    dbench(sync open) mode. but I'm not sure whether some
    scsi comand such as Synchronize Cache will be executed by Flush.

Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     924298     0.035   745.523
 Close         678499     0.002     0.093
 Rename         39280     0.597   744.714
 Unlink        187119     0.435   745.547
 Qpathinfo     838558     0.013     0.978
 Qfileinfo     146308     0.002     0.055
 Qfsinfo       154172     0.004     0.108
 Sfileinfo      75423     0.010     0.109
 Find          324478     0.050     1.115
 WriteX        457376     3.665   957.922
 ReadX        1454051     0.005     0.131
 LockX           3032     0.004     0.027
 UnlockX         3032     0.002     0.022
 Flush          64867     0.649   718.676

Throughput 481.002 MB/sec (sync open)  32 clients  32 procs  max_latency=957.929 ms
+ stat -f -c %T /btrfs/
btrfs
+ uname -r
5.11.0-0.rc6.141.el8.x86_64

detail:
  32     33682   548.70 MB/sec  execute   1 sec  latency 7.024 ms
  32     36692   538.41 MB/sec  execute   2 sec  latency 6.719 ms
  32     39787   544.75 MB/sec  execute   3 sec  latency 6.405 ms
  32     42850   540.09 MB/sec  execute   4 sec  latency 7.717 ms
  32     45872   535.64 MB/sec  execute   5 sec  latency 7.271 ms
  32     48861   524.29 MB/sec  execute   6 sec  latency 6.685 ms
  32     51343   512.54 MB/sec  execute   7 sec  latency 128.666 ms
  32     53366   496.73 MB/sec  execute   8 sec  latency 703.255 ms *1
  32     56428   498.60 MB/sec  execute   9 sec  latency 6.346 ms
  32     59471   498.40 MB/sec  execute  10 sec  latency 9.958 ms
  32     62175   495.68 MB/sec  execute  11 sec  latency 14.270 ms
  32     65143   498.90 MB/sec  execute  12 sec  latency 9.391 ms
  32     68116   502.19 MB/sec  execute  13 sec  latency 5.713 ms
  32     71078   501.45 MB/sec  execute  14 sec  latency 6.235 ms
  32     74030   500.43 MB/sec  execute  15 sec  latency 7.135 ms
  32     76908   500.82 MB/sec  execute  16 sec  latency 7.071 ms
  32     79794   499.61 MB/sec  execute  17 sec  latency 7.556 ms
  32     82791   502.30 MB/sec  execute  18 sec  latency 6.509 ms
  32     85676   502.05 MB/sec  execute  19 sec  latency 6.938 ms
  32     86950   486.44 MB/sec  execute  20 sec  latency 554.015 ms *2
  32     89670   487.60 MB/sec  execute  21 sec  latency 901.490 ms *3
  32     92577   487.64 MB/sec  execute  22 sec  latency 6.715 ms
  32     95528   488.03 MB/sec  execute  23 sec  latency 7.457 ms
  32     98507   488.76 MB/sec  execute  24 sec  latency 7.266 ms
  32    101340   488.76 MB/sec  execute  25 sec  latency 6.699 ms
  32    104331   489.94 MB/sec  execute  26 sec  latency 6.506 ms
  32    107166   490.87 MB/sec  execute  27 sec  latency 6.582 ms
  32    110022   490.17 MB/sec  execute  28 sec  latency 7.072 ms
  32    112931   490.34 MB/sec  execute  29 sec  latency 6.484 ms
  32    115658   489.67 MB/sec  execute  30 sec  latency 6.767 ms
  32    118569   490.14 MB/sec  execute  31 sec  latency 6.825 ms
  32    121334   490.34 MB/sec  execute  32 sec  latency 7.270 ms
  32    124182   489.95 MB/sec  execute  33 sec  latency 6.849 ms
  32    127073   490.05 MB/sec  execute  34 sec  latency 6.934 ms
  32    129835   489.56 MB/sec  execute  35 sec  latency 7.455 ms
  32    130952   481.58 MB/sec  execute  36 sec  latency 635.676 ms *4
  32    133736   481.74 MB/sec  execute  37 sec  latency 957.929 ms *5
  32    136646   481.79 MB/sec  execute  38 sec  latency 7.339 ms
  32    139616   483.23 MB/sec  execute  39 sec  latency 7.199 ms
  32    142526   483.62 MB/sec  execute  40 sec  latency 7.344 ms
  32    145429   483.58 MB/sec  execute  41 sec  latency 6.967 ms
  32    148329   484.09 MB/sec  execute  42 sec  latency 8.043 ms
  32    151091   483.89 MB/sec  execute  43 sec  latency 7.476 ms
  32    153913   484.33 MB/sec  execute  44 sec  latency 7.611 ms
  32    156679   484.29 MB/sec  execute  45 sec  latency 7.612 ms
  32    159534   483.90 MB/sec  execute  46 sec  latency 8.295 ms
  32    162328   484.17 MB/sec  execute  47 sec  latency 6.582 ms
  32    165080   483.64 MB/sec  execute  48 sec  latency 8.939 ms
  32    167861   484.12 MB/sec  execute  49 sec  latency 6.684 ms
  32    170616   483.56 MB/sec  execute  50 sec  latency 7.051 ms
  32    173557   483.89 MB/sec  execute  51 sec  latency 6.923 ms
  32    176424   484.52 MB/sec  execute  52 sec  latency 6.689 ms
  32    179255   484.14 MB/sec  execute  53 sec  latency 7.973 ms
  32    181195   481.47 MB/sec  execute  54 sec  latency 305.495 ms
  32    183309   479.62 MB/sec  execute  55 sec  latency 866.862 ms *6
  32    186256   479.82 MB/sec  execute  56 sec  latency 7.016 ms
  32    189209   480.82 MB/sec  execute  57 sec  latency 6.789 ms
  32    192072   480.93 MB/sec  execute  58 sec  latency 7.305 ms
  32    195054   481.00 MB/sec  execute  59 sec  latency 7.432 ms


Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/03

> 
> 
> On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
> > On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
> >>
> >> Hi, Filipe Manana
> >>
> >> There are some dbench(sync mode) result on the same hardware,
> >> but with different linux kernel
> >>
> >> 4.14.200
> >> Operation      Count    AvgLat    MaxLat
> >>  ----------------------------------------
> >>  WriteX        225281     5.163    82.143
> >>  Flush          32161     2.250    62.669
> >> Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms
> >>
> >> 4.19.21
> >> Operation      Count    AvgLat    MaxLat
> >>  ----------------------------------------
> >>  WriteX        118842    10.946   116.345
> >>  Flush          16506     0.115    44.575
> >> Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms
> >>
> >> 4.19.150
> >>  Operation      Count    AvgLat    MaxLat
> >>  ----------------------------------------
> >>  WriteX        144509     9.151   117.353
> >>  lush          20563     0.128    52.014
> >> Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms
> >>
> >> 5.4.91
> >>  Operation      Count    AvgLat    MaxLat
> >>  ----------------------------------------
> >>  WriteX        367033     4.377  1908.724
> >>  Flush          52037     0.159    39.871
> >> Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms
> > 
> > Ok, it seems somewhere between 4.19 and 5.4, something made the
> > latency much worse for you at least.
> > 
> > Is it only when using sync open (O_SYNC, dbench's -s flag), what about
> > when not using it?
> > 
> > I'll have to look at it, but it will likely take some time.
> 
> 
> This seems like the perf regression I observed starting with kernel 5.0,
> essentially preemptive flush of metadata was broken for quite some time,
> but kernel 5.0 removed a btrfs_end_transaction call from
> should_end_transaction which unmasked the issue.
> 
> In particular this should have been fixed by the following commit in
> misc-next:
> 
> https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
> which is part of a larger series of patches. So Wang, in order to test
> this hypothesis can you re-run those tests with the latest misc-next
> branch .
> 
> <snip>
Wang Yugui Feb. 4, 2021, 3:17 a.m. UTC | #8
Hi,

I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
5.10.12 with the same other kernel components and the same kernel config.

Better dbench(sync open) result on both Throughput and max_latency.

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     958611     0.026   440.564
 Close         703418     0.002     0.085
 Rename         40599     0.518   439.825
 Unlink        194095     0.282   439.792
 Qpathinfo     869295     0.010     1.852
 Qfileinfo     151285     0.002     0.095
 Qfsinfo       159492     0.004     0.051
 Sfileinfo      78078     0.010     0.104
 Find          335905     0.036     3.043
 WriteX        473353     3.637   713.039
 ReadX        1502197     0.005     0.752
 LockX           3100     0.004     0.024
 UnlockX         3100     0.002     0.020
 Flush          67264     0.575   363.550

Throughput 497.551 MB/sec (sync open)  32 clients  32 procs  max_latency=713.045 ms

detail:
  32     34620   506.62 MB/sec  execute   1 sec  latency 6.860 ms
  32     37878   528.91 MB/sec  execute   2 sec  latency 6.366 ms
  32     41015   541.17 MB/sec  execute   3 sec  latency 6.326 ms
  32     43981   539.10 MB/sec  execute   4 sec  latency 6.743 ms
  32     47111   540.92 MB/sec  execute   5 sec  latency 7.597 ms
  32     50163   544.44 MB/sec  execute   6 sec  latency 6.984 ms
  32     52952   532.23 MB/sec  execute   7 sec  latency 63.769 ms
  32     55561   514.29 MB/sec  execute   8 sec  latency 482.822 ms
  32     58415   515.72 MB/sec  execute   9 sec  latency 6.238 ms
  32     61365   515.94 MB/sec  execute  10 sec  latency 8.102 ms
  32     64309   518.10 MB/sec  execute  11 sec  latency 6.176 ms
  32     67322   517.59 MB/sec  execute  12 sec  latency 6.690 ms
  32     70343   515.28 MB/sec  execute  13 sec  latency 6.831 ms
  32     73323   512.81 MB/sec  execute  14 sec  latency 7.754 ms
  32     76151   513.38 MB/sec  execute  15 sec  latency 7.275 ms
  32     79042   513.44 MB/sec  execute  16 sec  latency 6.678 ms
  32     81995   514.71 MB/sec  execute  17 sec  latency 8.393 ms
  32     84911   512.80 MB/sec  execute  18 sec  latency 7.574 ms
  32     87884   510.68 MB/sec  execute  19 sec  latency 8.363 ms
  32     89711   502.50 MB/sec  execute  20 sec  latency 695.270 ms *1
  32     92628   503.68 MB/sec  execute  21 sec  latency 7.405 ms
  32     95627   503.66 MB/sec  execute  22 sec  latency 6.545 ms
  32     98638   502.90 MB/sec  execute  23 sec  latency 6.649 ms
  32    101641   503.29 MB/sec  execute  24 sec  latency 7.740 ms
  32    104515   503.72 MB/sec  execute  25 sec  latency 6.941 ms
  32    107448   504.28 MB/sec  execute  26 sec  latency 6.704 ms
  32    110321   504.79 MB/sec  execute  27 sec  latency 6.216 ms
  32    113235   503.77 MB/sec  execute  28 sec  latency 7.366 ms
  32    116185   503.18 MB/sec  execute  29 sec  latency 7.153 ms
  32    119074   503.35 MB/sec  execute  30 sec  latency 7.026 ms
  32    121970   503.51 MB/sec  execute  31 sec  latency 8.789 ms
  32    123948   498.76 MB/sec  execute  32 sec  latency 539.597 ms *2
  32    127045   499.14 MB/sec  execute  33 sec  latency 6.112 ms
  32    130059   499.66 MB/sec  execute  34 sec  latency 7.762 ms
  32    133062   500.45 MB/sec  execute  35 sec  latency 6.706 ms
  32    136014   501.03 MB/sec  execute  36 sec  latency 8.384 ms
  32    139018   502.13 MB/sec  execute  37 sec  latency 6.822 ms
  32    142047   502.50 MB/sec  execute  38 sec  latency 6.924 ms
  32    144935   501.74 MB/sec  execute  39 sec  latency 7.319 ms
  32    147836   501.38 MB/sec  execute  40 sec  latency 6.958 ms
  32    150677   501.36 MB/sec  execute  41 sec  latency 6.993 ms
  32    153562   501.63 MB/sec  execute  42 sec  latency 7.126 ms
  32    156385   501.63 MB/sec  execute  43 sec  latency 7.602 ms
  32    159258   500.99 MB/sec  execute  44 sec  latency 7.281 ms
  32    162248   501.05 MB/sec  execute  45 sec  latency 6.896 ms
  32    165128   501.12 MB/sec  execute  46 sec  latency 7.134 ms
  32    168070   501.39 MB/sec  execute  47 sec  latency 6.674 ms
  32    169978   497.76 MB/sec  execute  48 sec  latency 355.037 ms
  32    172936   497.92 MB/sec  execute  49 sec  latency 713.045 ms *3
  32    175828   497.93 MB/sec  execute  50 sec  latency 6.585 ms
  32    178731   498.06 MB/sec  execute  51 sec  latency 8.511 ms
  32    181681   498.79 MB/sec  execute  52 sec  latency 7.093 ms
  32    184622   498.83 MB/sec  execute  53 sec  latency 6.265 ms
  32    187647   498.90 MB/sec  execute  54 sec  latency 6.902 ms
  32    190125   497.59 MB/sec  execute  55 sec  latency 142.628 ms
  32    192920   497.36 MB/sec  execute  56 sec  latency 432.224 ms
  32    195822   497.85 MB/sec  execute  57 sec  latency 6.497 ms
  32    198709   497.56 MB/sec  execute  58 sec  latency 6.418 ms
  32    201665   497.55 MB/sec  execute  59 sec  latency 6.699 ms

Because of kernel interface changed between 5.10 and 5.11 , so we need
some modify to btrfs misc-next(5.11-rc6 +81patches) to run on linux LTS
5.10.12,
# drop 0001-block-add-a-bdev_kobj-helper.patch
# drop  0001-block-remove-i_bdev.patch
# add  0001-btrfs-bdev_nr_sectors.patch

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/04

> Hi,
> 
> There is the dbench(sync open) test result of misc-next(5.11-rc6 +81patches)
> 
> 1, the MaxLat is changed from 1900ms level to 1000ms level.
>     that is a good improvement.
> 
> 2, It is OK that NTCreateX/Rename/Unlink/WriteX have the same level of
>     MaxLat because all of them will write something to disk.
> 
> 3, I'm not sure whether MaxLat of Flush is OK.
>     there should be nothing to write for Flush because of 
>     dbench(sync open) mode. but I'm not sure whether some
>     scsi comand such as Synchronize Cache will be executed by Flush.
> 
> Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX     924298     0.035   745.523
>  Close         678499     0.002     0.093
>  Rename         39280     0.597   744.714
>  Unlink        187119     0.435   745.547
>  Qpathinfo     838558     0.013     0.978
>  Qfileinfo     146308     0.002     0.055
>  Qfsinfo       154172     0.004     0.108
>  Sfileinfo      75423     0.010     0.109
>  Find          324478     0.050     1.115
>  WriteX        457376     3.665   957.922
>  ReadX        1454051     0.005     0.131
>  LockX           3032     0.004     0.027
>  UnlockX         3032     0.002     0.022
>  Flush          64867     0.649   718.676
> 
> Throughput 481.002 MB/sec (sync open)  32 clients  32 procs  max_latency=957.929 ms
> + stat -f -c %T /btrfs/
> btrfs
> + uname -r
> 5.11.0-0.rc6.141.el8.x86_64
> 
> detail:
>   32     33682   548.70 MB/sec  execute   1 sec  latency 7.024 ms
>   32     36692   538.41 MB/sec  execute   2 sec  latency 6.719 ms
>   32     39787   544.75 MB/sec  execute   3 sec  latency 6.405 ms
>   32     42850   540.09 MB/sec  execute   4 sec  latency 7.717 ms
>   32     45872   535.64 MB/sec  execute   5 sec  latency 7.271 ms
>   32     48861   524.29 MB/sec  execute   6 sec  latency 6.685 ms
>   32     51343   512.54 MB/sec  execute   7 sec  latency 128.666 ms
>   32     53366   496.73 MB/sec  execute   8 sec  latency 703.255 ms *1
>   32     56428   498.60 MB/sec  execute   9 sec  latency 6.346 ms
>   32     59471   498.40 MB/sec  execute  10 sec  latency 9.958 ms
>   32     62175   495.68 MB/sec  execute  11 sec  latency 14.270 ms
>   32     65143   498.90 MB/sec  execute  12 sec  latency 9.391 ms
>   32     68116   502.19 MB/sec  execute  13 sec  latency 5.713 ms
>   32     71078   501.45 MB/sec  execute  14 sec  latency 6.235 ms
>   32     74030   500.43 MB/sec  execute  15 sec  latency 7.135 ms
>   32     76908   500.82 MB/sec  execute  16 sec  latency 7.071 ms
>   32     79794   499.61 MB/sec  execute  17 sec  latency 7.556 ms
>   32     82791   502.30 MB/sec  execute  18 sec  latency 6.509 ms
>   32     85676   502.05 MB/sec  execute  19 sec  latency 6.938 ms
>   32     86950   486.44 MB/sec  execute  20 sec  latency 554.015 ms *2
>   32     89670   487.60 MB/sec  execute  21 sec  latency 901.490 ms *3
>   32     92577   487.64 MB/sec  execute  22 sec  latency 6.715 ms
>   32     95528   488.03 MB/sec  execute  23 sec  latency 7.457 ms
>   32     98507   488.76 MB/sec  execute  24 sec  latency 7.266 ms
>   32    101340   488.76 MB/sec  execute  25 sec  latency 6.699 ms
>   32    104331   489.94 MB/sec  execute  26 sec  latency 6.506 ms
>   32    107166   490.87 MB/sec  execute  27 sec  latency 6.582 ms
>   32    110022   490.17 MB/sec  execute  28 sec  latency 7.072 ms
>   32    112931   490.34 MB/sec  execute  29 sec  latency 6.484 ms
>   32    115658   489.67 MB/sec  execute  30 sec  latency 6.767 ms
>   32    118569   490.14 MB/sec  execute  31 sec  latency 6.825 ms
>   32    121334   490.34 MB/sec  execute  32 sec  latency 7.270 ms
>   32    124182   489.95 MB/sec  execute  33 sec  latency 6.849 ms
>   32    127073   490.05 MB/sec  execute  34 sec  latency 6.934 ms
>   32    129835   489.56 MB/sec  execute  35 sec  latency 7.455 ms
>   32    130952   481.58 MB/sec  execute  36 sec  latency 635.676 ms *4
>   32    133736   481.74 MB/sec  execute  37 sec  latency 957.929 ms *5
>   32    136646   481.79 MB/sec  execute  38 sec  latency 7.339 ms
>   32    139616   483.23 MB/sec  execute  39 sec  latency 7.199 ms
>   32    142526   483.62 MB/sec  execute  40 sec  latency 7.344 ms
>   32    145429   483.58 MB/sec  execute  41 sec  latency 6.967 ms
>   32    148329   484.09 MB/sec  execute  42 sec  latency 8.043 ms
>   32    151091   483.89 MB/sec  execute  43 sec  latency 7.476 ms
>   32    153913   484.33 MB/sec  execute  44 sec  latency 7.611 ms
>   32    156679   484.29 MB/sec  execute  45 sec  latency 7.612 ms
>   32    159534   483.90 MB/sec  execute  46 sec  latency 8.295 ms
>   32    162328   484.17 MB/sec  execute  47 sec  latency 6.582 ms
>   32    165080   483.64 MB/sec  execute  48 sec  latency 8.939 ms
>   32    167861   484.12 MB/sec  execute  49 sec  latency 6.684 ms
>   32    170616   483.56 MB/sec  execute  50 sec  latency 7.051 ms
>   32    173557   483.89 MB/sec  execute  51 sec  latency 6.923 ms
>   32    176424   484.52 MB/sec  execute  52 sec  latency 6.689 ms
>   32    179255   484.14 MB/sec  execute  53 sec  latency 7.973 ms
>   32    181195   481.47 MB/sec  execute  54 sec  latency 305.495 ms
>   32    183309   479.62 MB/sec  execute  55 sec  latency 866.862 ms *6
>   32    186256   479.82 MB/sec  execute  56 sec  latency 7.016 ms
>   32    189209   480.82 MB/sec  execute  57 sec  latency 6.789 ms
>   32    192072   480.93 MB/sec  execute  58 sec  latency 7.305 ms
>   32    195054   481.00 MB/sec  execute  59 sec  latency 7.432 ms
> 
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/03
> 
> > 
> > 
> > On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
> > > On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > >>
> > >> Hi, Filipe Manana
> > >>
> > >> There are some dbench(sync mode) result on the same hardware,
> > >> but with different linux kernel
> > >>
> > >> 4.14.200
> > >> Operation      Count    AvgLat    MaxLat
> > >>  ----------------------------------------
> > >>  WriteX        225281     5.163    82.143
> > >>  Flush          32161     2.250    62.669
> > >> Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms
> > >>
> > >> 4.19.21
> > >> Operation      Count    AvgLat    MaxLat
> > >>  ----------------------------------------
> > >>  WriteX        118842    10.946   116.345
> > >>  Flush          16506     0.115    44.575
> > >> Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms
> > >>
> > >> 4.19.150
> > >>  Operation      Count    AvgLat    MaxLat
> > >>  ----------------------------------------
> > >>  WriteX        144509     9.151   117.353
> > >>  lush          20563     0.128    52.014
> > >> Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms
> > >>
> > >> 5.4.91
> > >>  Operation      Count    AvgLat    MaxLat
> > >>  ----------------------------------------
> > >>  WriteX        367033     4.377  1908.724
> > >>  Flush          52037     0.159    39.871
> > >> Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms
> > > 
> > > Ok, it seems somewhere between 4.19 and 5.4, something made the
> > > latency much worse for you at least.
> > > 
> > > Is it only when using sync open (O_SYNC, dbench's -s flag), what about
> > > when not using it?
> > > 
> > > I'll have to look at it, but it will likely take some time.
> > 
> > 
> > This seems like the perf regression I observed starting with kernel 5.0,
> > essentially preemptive flush of metadata was broken for quite some time,
> > but kernel 5.0 removed a btrfs_end_transaction call from
> > should_end_transaction which unmasked the issue.
> > 
> > In particular this should have been fixed by the following commit in
> > misc-next:
> > 
> > https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
> > which is part of a larger series of patches. So Wang, in order to test
> > this hypothesis can you re-run those tests with the latest misc-next
> > branch .
> > 
> > <snip>
>
Nikolay Borisov Feb. 4, 2021, 6:19 a.m. UTC | #9
On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
> Hi,
> 
> I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
> 5.10.12 with the same other kernel components and the same kernel config.
> 
> Better dbench(sync open) result on both Throughput and max_latency.
> 

If i understand correctly you rebased current misc-next to 5.10.12, if
so this means there is something else in the main kernel, that's not
btrfs related which degrades performance, it seems you've got a 300ms
win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
right?

<snip>
>> Hi,
>>
>> There is the dbench(sync open) test result of misc-next(5.11-rc6 +81patches)
>>
>> 1, the MaxLat is changed from 1900ms level to 1000ms level.
>>     that is a good improvement.
>>
>> 2, It is OK that NTCreateX/Rename/Unlink/WriteX have the same level of
>>     MaxLat because all of them will write something to disk.
>>
>> 3, I'm not sure whether MaxLat of Flush is OK.
>>     there should be nothing to write for Flush because of 
>>     dbench(sync open) mode. but I'm not sure whether some
>>     scsi comand such as Synchronize Cache will be executed by Flush.
>>
>> Operation      Count    AvgLat    MaxLat
>>  ----------------------------------------
>>  NTCreateX     924298     0.035   745.523
>>  Close         678499     0.002     0.093
>>  Rename         39280     0.597   744.714
>>  Unlink        187119     0.435   745.547
>>  Qpathinfo     838558     0.013     0.978
>>  Qfileinfo     146308     0.002     0.055
>>  Qfsinfo       154172     0.004     0.108
>>  Sfileinfo      75423     0.010     0.109
>>  Find          324478     0.050     1.115
>>  WriteX        457376     3.665   957.922
>>  ReadX        1454051     0.005     0.131
>>  LockX           3032     0.004     0.027
>>  UnlockX         3032     0.002     0.022
>>  Flush          64867     0.649   718.676
>>
>> Throughput 481.002 MB/sec (sync open)  32 clients  32 procs  max_latency=957.929 ms
>> + stat -f -c %T /btrfs/
>> btrfs
>> + uname -r
>> 5.11.0-0.rc6.141.el8.x86_64
>>
>> detail:
>>   32     33682   548.70 MB/sec  execute   1 sec  latency 7.024 ms
>>   32     36692   538.41 MB/sec  execute   2 sec  latency 6.719 ms
>>   32     39787   544.75 MB/sec  execute   3 sec  latency 6.405 ms
>>   32     42850   540.09 MB/sec  execute   4 sec  latency 7.717 ms
>>   32     45872   535.64 MB/sec  execute   5 sec  latency 7.271 ms
>>   32     48861   524.29 MB/sec  execute   6 sec  latency 6.685 ms
>>   32     51343   512.54 MB/sec  execute   7 sec  latency 128.666 ms
>>   32     53366   496.73 MB/sec  execute   8 sec  latency 703.255 ms *1
>>   32     56428   498.60 MB/sec  execute   9 sec  latency 6.346 ms
>>   32     59471   498.40 MB/sec  execute  10 sec  latency 9.958 ms
>>   32     62175   495.68 MB/sec  execute  11 sec  latency 14.270 ms
>>   32     65143   498.90 MB/sec  execute  12 sec  latency 9.391 ms
>>   32     68116   502.19 MB/sec  execute  13 sec  latency 5.713 ms
>>   32     71078   501.45 MB/sec  execute  14 sec  latency 6.235 ms
>>   32     74030   500.43 MB/sec  execute  15 sec  latency 7.135 ms
>>   32     76908   500.82 MB/sec  execute  16 sec  latency 7.071 ms
>>   32     79794   499.61 MB/sec  execute  17 sec  latency 7.556 ms
>>   32     82791   502.30 MB/sec  execute  18 sec  latency 6.509 ms
>>   32     85676   502.05 MB/sec  execute  19 sec  latency 6.938 ms
>>   32     86950   486.44 MB/sec  execute  20 sec  latency 554.015 ms *2
>>   32     89670   487.60 MB/sec  execute  21 sec  latency 901.490 ms *3
>>   32     92577   487.64 MB/sec  execute  22 sec  latency 6.715 ms
>>   32     95528   488.03 MB/sec  execute  23 sec  latency 7.457 ms
>>   32     98507   488.76 MB/sec  execute  24 sec  latency 7.266 ms
>>   32    101340   488.76 MB/sec  execute  25 sec  latency 6.699 ms
>>   32    104331   489.94 MB/sec  execute  26 sec  latency 6.506 ms
>>   32    107166   490.87 MB/sec  execute  27 sec  latency 6.582 ms
>>   32    110022   490.17 MB/sec  execute  28 sec  latency 7.072 ms
>>   32    112931   490.34 MB/sec  execute  29 sec  latency 6.484 ms
>>   32    115658   489.67 MB/sec  execute  30 sec  latency 6.767 ms
>>   32    118569   490.14 MB/sec  execute  31 sec  latency 6.825 ms
>>   32    121334   490.34 MB/sec  execute  32 sec  latency 7.270 ms
>>   32    124182   489.95 MB/sec  execute  33 sec  latency 6.849 ms
>>   32    127073   490.05 MB/sec  execute  34 sec  latency 6.934 ms
>>   32    129835   489.56 MB/sec  execute  35 sec  latency 7.455 ms
>>   32    130952   481.58 MB/sec  execute  36 sec  latency 635.676 ms *4
>>   32    133736   481.74 MB/sec  execute  37 sec  latency 957.929 ms *5
>>   32    136646   481.79 MB/sec  execute  38 sec  latency 7.339 ms
>>   32    139616   483.23 MB/sec  execute  39 sec  latency 7.199 ms
>>   32    142526   483.62 MB/sec  execute  40 sec  latency 7.344 ms
>>   32    145429   483.58 MB/sec  execute  41 sec  latency 6.967 ms
>>   32    148329   484.09 MB/sec  execute  42 sec  latency 8.043 ms
>>   32    151091   483.89 MB/sec  execute  43 sec  latency 7.476 ms
>>   32    153913   484.33 MB/sec  execute  44 sec  latency 7.611 ms
>>   32    156679   484.29 MB/sec  execute  45 sec  latency 7.612 ms
>>   32    159534   483.90 MB/sec  execute  46 sec  latency 8.295 ms
>>   32    162328   484.17 MB/sec  execute  47 sec  latency 6.582 ms
>>   32    165080   483.64 MB/sec  execute  48 sec  latency 8.939 ms
>>   32    167861   484.12 MB/sec  execute  49 sec  latency 6.684 ms
>>   32    170616   483.56 MB/sec  execute  50 sec  latency 7.051 ms
>>   32    173557   483.89 MB/sec  execute  51 sec  latency 6.923 ms
>>   32    176424   484.52 MB/sec  execute  52 sec  latency 6.689 ms
>>   32    179255   484.14 MB/sec  execute  53 sec  latency 7.973 ms
>>   32    181195   481.47 MB/sec  execute  54 sec  latency 305.495 ms
>>   32    183309   479.62 MB/sec  execute  55 sec  latency 866.862 ms *6
>>   32    186256   479.82 MB/sec  execute  56 sec  latency 7.016 ms
>>   32    189209   480.82 MB/sec  execute  57 sec  latency 6.789 ms
>>   32    192072   480.93 MB/sec  execute  58 sec  latency 7.305 ms
>>   32    195054   481.00 MB/sec  execute  59 sec  latency 7.432 ms
>>
>>
>> Best Regards
>> Wang Yugui (wangyugui@e16-tech.com)
>> 2021/02/03
>>
>>>
>>>
>>> On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
>>>> On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>>>>>
>>>>> Hi, Filipe Manana
>>>>>
>>>>> There are some dbench(sync mode) result on the same hardware,
>>>>> but with different linux kernel
>>>>>
>>>>> 4.14.200
>>>>> Operation      Count    AvgLat    MaxLat
>>>>>  ----------------------------------------
>>>>>  WriteX        225281     5.163    82.143
>>>>>  Flush          32161     2.250    62.669
>>>>> Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms
>>>>>
>>>>> 4.19.21
>>>>> Operation      Count    AvgLat    MaxLat
>>>>>  ----------------------------------------
>>>>>  WriteX        118842    10.946   116.345
>>>>>  Flush          16506     0.115    44.575
>>>>> Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms
>>>>>
>>>>> 4.19.150
>>>>>  Operation      Count    AvgLat    MaxLat
>>>>>  ----------------------------------------
>>>>>  WriteX        144509     9.151   117.353
>>>>>  lush          20563     0.128    52.014
>>>>> Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms
>>>>>
>>>>> 5.4.91
>>>>>  Operation      Count    AvgLat    MaxLat
>>>>>  ----------------------------------------
>>>>>  WriteX        367033     4.377  1908.724
>>>>>  Flush          52037     0.159    39.871
>>>>> Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms
>>>>
>>>> Ok, it seems somewhere between 4.19 and 5.4, something made the
>>>> latency much worse for you at least.
>>>>
>>>> Is it only when using sync open (O_SYNC, dbench's -s flag), what about
>>>> when not using it?
>>>>
>>>> I'll have to look at it, but it will likely take some time.
>>>
>>>
>>> This seems like the perf regression I observed starting with kernel 5.0,
>>> essentially preemptive flush of metadata was broken for quite some time,
>>> but kernel 5.0 removed a btrfs_end_transaction call from
>>> should_end_transaction which unmasked the issue.
>>>
>>> In particular this should have been fixed by the following commit in
>>> misc-next:
>>>
>>> https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
>>> which is part of a larger series of patches. So Wang, in order to test
>>> this hypothesis can you re-run those tests with the latest misc-next
>>> branch .
>>>
>>> <snip>
>>
> 
>
Wang Yugui Feb. 4, 2021, 11:34 a.m. UTC | #10
Hi,

> On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
> > Hi,
> > 
> > I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
> > 5.10.12 with the same other kernel components and the same kernel config.
> > 
> > Better dbench(sync open) result on both Throughput and max_latency.
> > 
> 
> If i understand correctly you rebased current misc-next to 5.10.12, if
> so this means there is something else in the main kernel, that's not
> btrfs related which degrades performance, it seems you've got a 300ms
> win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
> right?

Yes.

maybye some code rather than btrfs in main kernel 5.11.rc6 have degrades
performance.
or maybe just because of different kernel config.
kernel config I used:
https://kojipkgs.fedoraproject.org/packages/kernel/5.11.0/0.rc6.141.eln108/
https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/

I rebased current misc-next to 5.10.12, so that there is only diff in
btrfs source code.

only 3 minor patch needed for this rebase, there seems no broken kernel API
 change for btrfs between 5.10 and 5.11.
# add-to-5.10  0001-block-add-a-bdev_kobj-helper.patch
# drop-from-btrs-misc-next  0001-block-remove-i_bdev.patch
# fix-to-btrfs-misc-next  	0001-btrfs-bdev_nr_sectors.patch

more patch come into misc-next today, they are yet not rebased/tested.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/04
Nikolay Borisov Feb. 4, 2021, 11:58 a.m. UTC | #11
On 4.02.21 г. 13:34 ч., Wang Yugui wrote:
> Hi,
> 
>> On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
>>> Hi,
>>>
>>> I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
>>> 5.10.12 with the same other kernel components and the same kernel config.
>>>
>>> Better dbench(sync open) result on both Throughput and max_latency.
>>>
>>
>> If i understand correctly you rebased current misc-next to 5.10.12, if
>> so this means there is something else in the main kernel, that's not
>> btrfs related which degrades performance, it seems you've got a 300ms
>> win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
>> right?
> 
> Yes.

I just realized this could also be caused by btrfs code that has already
landed in v5.11-rc1 for example. I.e the main pull req for this release.

> 
> maybye some code rather than btrfs in main kernel 5.11.rc6 have degrades
> performance.
> or maybe just because of different kernel config.
> kernel config I used:
> https://kojipkgs.fedoraproject.org/packages/kernel/5.11.0/0.rc6.141.eln108/
> https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/
> 
> I rebased current misc-next to 5.10.12, so that there is only diff in
> btrfs source code.
> 
> only 3 minor patch needed for this rebase, there seems no broken kernel API
>  change for btrfs between 5.10 and 5.11.
> # add-to-5.10  0001-block-add-a-bdev_kobj-helper.patch
> # drop-from-btrs-misc-next  0001-block-remove-i_bdev.patch
> # fix-to-btrfs-misc-next  	0001-btrfs-bdev_nr_sectors.patch
> 
> more patch come into misc-next today, they are yet not rebased/tested.
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/04
>
Wang Yugui Feb. 5, 2021, 1:47 a.m. UTC | #12
Hi,

> On 4.02.21 г. 13:34 ч., Wang Yugui wrote:
> > Hi,
> > 
> >> On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
> >>> Hi,
> >>>
> >>> I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
> >>> 5.10.12 with the same other kernel components and the same kernel config.
> >>>
> >>> Better dbench(sync open) result on both Throughput and max_latency.
> >>>
> >>
> >> If i understand correctly you rebased current misc-next to 5.10.12, if
> >> so this means there is something else in the main kernel, that's not
> >> btrfs related which degrades performance, it seems you've got a 300ms
> >> win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
> >> right?
> > 
> > Yes.
> 
> I just realized this could also be caused by btrfs code that has already
> landed in v5.11-rc1 for example. I.e the main pull req for this release.

This performance problem seems very complex.

result of btrfs 5.10.12 + these btrfs patchset +  linux other kernel of 5.10.12
- 5.11 free-space-tree 13 patch
- 5.11 btrfs: some performance improvements for dbench alike workloads
- misc-next btrfs: more performance improvements for dbench workloads
- misc-next Improve preemptive ENOSPC flushing

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     858500     0.030     2.969
 Close         630548     0.002     0.071
 Rename         36416     0.744     3.791
 Unlink        173375     0.503     4.109
 Qpathinfo     779355     0.014     2.715
 Qfileinfo     136041     0.002     0.054
 Qfsinfo       142728     0.004     0.084
 Sfileinfo      69953     0.010     0.098
 Find          301093     0.059     2.670
 WriteX        424594     3.999  1765.527
 ReadX        1348726     0.005     0.397
 LockX           2816     0.004     0.022
 UnlockX         2816     0.003     0.025
 Flush          60094     0.764  1478.133

Throughput 445.83 MB/sec (sync open)  32 clients  32 procs  max_latency=1765.539 ms

still worse than btrfs misc-next(20210203) + linux other kernel of 5.10.12

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     958611     0.026   440.564
 Close         703418     0.002     0.085
 Rename         40599     0.518   439.825
 Unlink        194095     0.282   439.792
 Qpathinfo     869295     0.010     1.852
 Qfileinfo     151285     0.002     0.095
 Qfsinfo       159492     0.004     0.051
 Sfileinfo      78078     0.010     0.104
 Find          335905     0.036     3.043
 WriteX        473353     3.637   713.039
 ReadX        1502197     0.005     0.752
 LockX           3100     0.004     0.024
 UnlockX         3100     0.002     0.020
 Flush          67264     0.575   363.550

Throughput 497.551 MB/sec (sync open)  32 clients  32 procs  max_latency=713.045 ms

so there is some other patch in misc-next improved this performance too.

but MaxLat/AvgLat of WriteX  is still 175 in 
btrfs misc-next(20210203) + linux other kernel of 5.10.12,
so there maybe some patch after 5.10 cause new performance problem too.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/04


> > 
> > maybye some code rather than btrfs in main kernel 5.11.rc6 have degrades
> > performance.
> > or maybe just because of different kernel config.
> > kernel config I used:
> > https://kojipkgs.fedoraproject.org/packages/kernel/5.11.0/0.rc6.141.eln108/
> > https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/
> > 
> > I rebased current misc-next to 5.10.12, so that there is only diff in
> > btrfs source code.
> > 
> > only 3 minor patch needed for this rebase, there seems no broken kernel API
> >  change for btrfs between 5.10 and 5.11.
> > # add-to-5.10  0001-block-add-a-bdev_kobj-helper.patch
> > # drop-from-btrs-misc-next  0001-block-remove-i_bdev.patch
> > # fix-to-btrfs-misc-next  	0001-btrfs-bdev_nr_sectors.patch
> > 
> > more patch come into misc-next today, they are yet not rebased/tested.
> > 
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/02/04
> >
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c0c6e79c43f9..7185384f475a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1026,7 +1026,7 @@  enum {
 	BTRFS_ROOT_DEAD_RELOC_TREE,
 	/* Mark dead root stored on device whose cleanup needs to be resumed */
 	BTRFS_ROOT_DEAD_TREE,
-	/* The root has a log tree. Used only for subvolume roots. */
+	/* The root has a log tree. Used for subvolume roots and the tree root. */
 	BTRFS_ROOT_HAS_LOG_TREE,
 	/* Qgroup flushing is in progress */
 	BTRFS_ROOT_QGROUP_FLUSHING,
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index bc5b652f4f64..e8b84543d565 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -139,8 +139,25 @@  static int start_log_trans(struct btrfs_trans_handle *trans,
 			   struct btrfs_log_ctx *ctx)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_root *tree_root = fs_info->tree_root;
 	int ret = 0;
 
+	/*
+	 * First check if the log root tree was already created. If not, create
+	 * it before locking the root's log_mutex, just to keep lockdep happy.
+	 */
+	if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state)) {
+		mutex_lock(&tree_root->log_mutex);
+		if (!fs_info->log_root_tree) {
+			ret = btrfs_init_log_root_tree(trans, fs_info);
+			if (!ret)
+				set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state);
+		}
+		mutex_unlock(&tree_root->log_mutex);
+		if (ret)
+			return ret;
+	}
+
 	mutex_lock(&root->log_mutex);
 
 	if (root->log_root) {
@@ -156,13 +173,6 @@  static int start_log_trans(struct btrfs_trans_handle *trans,
 			set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
 		}
 	} else {
-		mutex_lock(&fs_info->tree_log_mutex);
-		if (!fs_info->log_root_tree)
-			ret = btrfs_init_log_root_tree(trans, fs_info);
-		mutex_unlock(&fs_info->tree_log_mutex);
-		if (ret)
-			goto out;
-
 		ret = btrfs_add_log_tree(trans, root);
 		if (ret)
 			goto out;
@@ -3022,6 +3032,8 @@  int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	int log_transid = 0;
 	struct btrfs_log_ctx root_log_ctx;
 	struct blk_plug plug;
+	u64 log_root_start;
+	u64 log_root_level;
 
 	mutex_lock(&root->log_mutex);
 	log_transid = ctx->log_transid;
@@ -3199,22 +3211,31 @@  int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		goto out_wake_log_root;
 	}
 
-	btrfs_set_super_log_root(fs_info->super_for_commit,
-				 log_root_tree->node->start);
-	btrfs_set_super_log_root_level(fs_info->super_for_commit,
-				       btrfs_header_level(log_root_tree->node));
-
+	log_root_start = log_root_tree->node->start;
+	log_root_level = btrfs_header_level(log_root_tree->node);
 	log_root_tree->log_transid++;
 	mutex_unlock(&log_root_tree->log_mutex);
 
 	/*
-	 * Nobody else is going to jump in and write the ctree
-	 * super here because the log_commit atomic below is protecting
-	 * us.  We must be called with a transaction handle pinning
-	 * the running transaction open, so a full commit can't hop
-	 * in and cause problems either.
+	 * Here we are guaranteed that nobody is going to write the superblock
+	 * for the current transaction before us and that neither we do write
+	 * our superblock before the previous transaction finishes its commit
+	 * and writes its superblock, because:
+	 *
+	 * 1) We are holding a handle on the current transaction, so no body
+	 *    can commit it until we release the handle;
+	 *
+	 * 2) Before writing our superblock we acquire the tree_log_mutex, so
+	 *    if the previous transaction is still committing, and hasn't yet
+	 *    written its superblock, we wait for it to do it, because a
+	 *    transaction commit acquires the tree_log_mutex when the commit
+	 *    begins and releases it only after writing its superblock.
 	 */
+	mutex_lock(&fs_info->tree_log_mutex);
+	btrfs_set_super_log_root(fs_info->super_for_commit, log_root_start);
+	btrfs_set_super_log_root_level(fs_info->super_for_commit, log_root_level);
 	ret = write_all_supers(fs_info, 1);
+	mutex_unlock(&fs_info->tree_log_mutex);
 	if (ret) {
 		btrfs_set_log_full_commit(trans);
 		btrfs_abort_transaction(trans, ret);
@@ -3299,6 +3320,7 @@  int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
 	if (fs_info->log_root_tree) {
 		free_log_tree(trans, fs_info->log_root_tree);
 		fs_info->log_root_tree = NULL;
+		clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &fs_info->tree_root->state);
 	}
 	return 0;
 }