diff mbox series

[v3,2/3] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT

Message ID 20200708062447.81341-3-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: qgroup: Fix the long existing regression of btrfs/153 | expand

Commit Message

Qu Wenruo July 8, 2020, 6:24 a.m. UTC
[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space.
One of the most obvious case is the the test case btrfs/153, which do
fallocate, then write into the preallocated range.

  btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
      --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
      +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
      @@ -1,2 +1,5 @@
       QA output created by 153
      +pwrite: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
       Silence is golden
      ...
      (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)

[CAUSE]
Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.

Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.

For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.

This leads to the -EDQUOT in buffered write routine.

And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/meta.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.

[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space by:
- Flush all inodes of the root
  NODATACOW writes will free the qgroup reserved at run_dealloc_range().
  However we don't have the infrastructure to only flush NODATACOW
  inodes, here we flush all inodes anyway.

- Wait ordered extents
  This would convert the preallocated metadata space into per-trans
  metadata, which can be freed in later transaction commit.

- Commit transaction
  This will free all per-trans metadata space.

Also we don't want to trigger flush too racy, so here we introduce a
per-root mutex to ensure if there is a running qgroup flushing, we wait
for it to end and don't start re-flush.

Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/ctree.h   |   1 +
 fs/btrfs/disk-io.c |   1 +
 fs/btrfs/qgroup.c  | 101 +++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 95 insertions(+), 8 deletions(-)

Comments

Josef Bacik July 8, 2020, 2:10 p.m. UTC | #1
On 7/8/20 2:24 AM, Qu Wenruo wrote:
> [PROBLEM]
> There are known problem related to how btrfs handles qgroup reserved
> space.
> One of the most obvious case is the the test case btrfs/153, which do
> fallocate, then write into the preallocated range.
> 
>    btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
>        --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>        +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
>        @@ -1,2 +1,5 @@
>         QA output created by 153
>        +pwrite: Disk quota exceeded
>        +/mnt/scratch/testfile2: Disk quota exceeded
>        +/mnt/scratch/testfile2: Disk quota exceeded
>         Silence is golden
>        ...
>        (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
> 
> [CAUSE]
> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
> we always reserve space no matter if it's COW or not.
> 
> Such behavior change is mostly for performance, and reverting it is not
> a good idea anyway.
> 
> For preallcoated extent, we reserve qgroup data space for it already,
> and since we also reserve data space for qgroup at buffered write time,
> it needs twice the space for us to write into preallocated space.
> 
> This leads to the -EDQUOT in buffered write routine.
> 
> And we can't follow the same solution, unlike data/meta space check,
> qgroup reserved space is shared between data/meta.
> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
> check after qgroup reservation failure is not a solution.
> 
> [FIX]
> To solve the problem, we don't return -EDQUOT directly, but every time
> we got a -EDQUOT, we try to flush qgroup space by:
> - Flush all inodes of the root
>    NODATACOW writes will free the qgroup reserved at run_dealloc_range().
>    However we don't have the infrastructure to only flush NODATACOW
>    inodes, here we flush all inodes anyway.
> 
> - Wait ordered extents
>    This would convert the preallocated metadata space into per-trans
>    metadata, which can be freed in later transaction commit.
> 
> - Commit transaction
>    This will free all per-trans metadata space.
> 
> Also we don't want to trigger flush too racy, so here we introduce a
> per-root mutex to ensure if there is a running qgroup flushing, we wait
> for it to end and don't start re-flush.
> 
> Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
David Sterba July 9, 2020, 4:32 p.m. UTC | #2
On Wed, Jul 08, 2020 at 02:24:46PM +0800, Qu Wenruo wrote:
> -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
> +static int try_flush_qgroup(struct btrfs_root *root)
> +{
> +	struct btrfs_trans_handle *trans;
> +	int ret;
> +
> +	/*
> +	 * We don't want to run flush again and again, so if there is a running
> +	 * one, we won't try to start a new flush, but exit directly.
> +	 */
> +	ret = mutex_trylock(&root->qgroup_flushing_mutex);
> +	if (!ret) {
> +		mutex_lock(&root->qgroup_flushing_mutex);
> +		mutex_unlock(&root->qgroup_flushing_mutex);

This is abuse of mutex, for status tracking "is somebody flushing" and
for waiting until it's over.

We have many root::status bits (the BTRFS_ROOT_* namespace) so that
qgroups are flushing should another one. The bit atomically set when it
starts and cleared when it ends.

All waiting tasks should queue in a normal wait_queue_head.

> +		return 0;
> +	}
> +
> +	ret = btrfs_start_delalloc_snapshot(root);
> +	if (ret < 0)
> +		goto unlock;
> +	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
> +
> +	trans = btrfs_join_transaction(root);
> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		goto unlock;
> +	}
> +
> +	ret = btrfs_commit_transaction(trans);
> +unlock:
> +	mutex_unlock(&root->qgroup_flushing_mutex);

And also the whole wait/join/commit combo is in one huge mutex, that's
really an anti-pattern.
David Sterba July 9, 2020, 5:40 p.m. UTC | #3
On Thu, Jul 09, 2020 at 06:32:46PM +0200, David Sterba wrote:
> On Wed, Jul 08, 2020 at 02:24:46PM +0800, Qu Wenruo wrote:
> > -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
> > +static int try_flush_qgroup(struct btrfs_root *root)
> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	int ret;
> > +
> > +	/*
> > +	 * We don't want to run flush again and again, so if there is a running
> > +	 * one, we won't try to start a new flush, but exit directly.
> > +	 */
> > +	ret = mutex_trylock(&root->qgroup_flushing_mutex);
> > +	if (!ret) {
> > +		mutex_lock(&root->qgroup_flushing_mutex);
> > +		mutex_unlock(&root->qgroup_flushing_mutex);
> 
> This is abuse of mutex, for status tracking "is somebody flushing" and
> for waiting until it's over.
> 
> We have many root::status bits (the BTRFS_ROOT_* namespace) so that
> qgroups are flushing should another one. The bit atomically set when it
> starts and cleared when it ends.
> 
> All waiting tasks should queue in a normal wait_queue_head.

Something like that (untested):

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cc1fcaa7cfa..5dbb6b7300b7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1007,6 +1007,12 @@ enum {
 	BTRFS_ROOT_DEAD_TREE,
 	/* The root has a log tree. Used only for subvolume roots. */
 	BTRFS_ROOT_HAS_LOG_TREE,
+
+	/*
+	 * Indicate that qgroup flushing is in progress to prevent multiple
+	 * processes attempting that
+	 */
+	BTRFS_ROOT_QGROUP_FLUSHING,
 };
 
 /*
@@ -1159,7 +1165,7 @@ struct btrfs_root {
 	spinlock_t qgroup_meta_rsv_lock;
 	u64 qgroup_meta_rsv_pertrans;
 	u64 qgroup_meta_rsv_prealloc;
-	struct mutex qgroup_flushing_mutex;
+	wait_queue_head_t qgroup_flush_wait;
 
 	/* Number of active swapfiles */
 	atomic_t nr_swapfiles;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8029127df537..e124e3376208 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1116,7 +1116,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
 	mutex_init(&root->delalloc_mutex);
-	mutex_init(&root->qgroup_flushing_mutex);
+	init_waitqueue_head(&root->qgroup_flush_wait);
 	init_waitqueue_head(&root->log_writer_wait);
 	init_waitqueue_head(&root->log_commit_wait[0]);
 	init_waitqueue_head(&root->log_commit_wait[1]);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 494ab2b1bbf2..95695aca7aa9 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3503,10 +3503,9 @@ static int try_flush_qgroup(struct btrfs_root *root)
 	 * We don't want to run flush again and again, so if there is a running
 	 * one, we won't try to start a new flush, but exit directly.
 	 */
-	ret = mutex_trylock(&root->qgroup_flushing_mutex);
-	if (!ret) {
-		mutex_lock(&root->qgroup_flushing_mutex);
-		mutex_unlock(&root->qgroup_flushing_mutex);
+	if (test_and_set_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state)) {
+		wait_event(root->qgroup_flush_wait,
+			!test_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state));
 		return 0;
 	}
 
@@ -3523,7 +3522,7 @@ static int try_flush_qgroup(struct btrfs_root *root)
 
 	ret = btrfs_commit_transaction(trans);
 unlock:
-	mutex_unlock(&root->qgroup_flushing_mutex);
+	clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state);
 	return ret;
 }
Qu Wenruo July 9, 2020, 11:06 p.m. UTC | #4
On 2020/7/10 上午12:32, David Sterba wrote:
> On Wed, Jul 08, 2020 at 02:24:46PM +0800, Qu Wenruo wrote:
>> -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
>> +static int try_flush_qgroup(struct btrfs_root *root)
>> +{
>> +	struct btrfs_trans_handle *trans;
>> +	int ret;
>> +
>> +	/*
>> +	 * We don't want to run flush again and again, so if there is a running
>> +	 * one, we won't try to start a new flush, but exit directly.
>> +	 */
>> +	ret = mutex_trylock(&root->qgroup_flushing_mutex);
>> +	if (!ret) {
>> +		mutex_lock(&root->qgroup_flushing_mutex);
>> +		mutex_unlock(&root->qgroup_flushing_mutex);
> 
> This is abuse of mutex, for status tracking "is somebody flushing" and
> for waiting until it's over.
> 
> We have many root::status bits (the BTRFS_ROOT_* namespace) so that
> qgroups are flushing should another one. The bit atomically set when it
> starts and cleared when it ends.

In fact, I want to avoid plain wait_queue usage if possible.

Unlike mutex, wait_queue doesn't have good enough debug mechanism like
lockdep.

I see no reason re-implementing the existing mutex code by ourselves
could bring any benefit here.

It may looks like an abuse of mutex, but I could wrap it into something
like wait_or_lock_mutex(), which may slightly improve the readability.

Or am I missing anything?

> 
> All waiting tasks should queue in a normal wait_queue_head.
> 
>> +		return 0;
>> +	}
>> +
>> +	ret = btrfs_start_delalloc_snapshot(root);
>> +	if (ret < 0)
>> +		goto unlock;
>> +	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>> +
>> +	trans = btrfs_join_transaction(root);
>> +	if (IS_ERR(trans)) {
>> +		ret = PTR_ERR(trans);
>> +		goto unlock;
>> +	}
>> +
>> +	ret = btrfs_commit_transaction(trans);
>> +unlock:
>> +	mutex_unlock(&root->qgroup_flushing_mutex);
> 
> And also the whole wait/join/commit combo is in one huge mutex, that's
> really an anti-pattern.
> 
But that mutex is per-root, and is the slow path.

Converting to wait_queue won't change the pattern either.

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4dd478b4fe3a..891f47c7891f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1162,6 +1162,7 @@  struct btrfs_root {
 	spinlock_t qgroup_meta_rsv_lock;
 	u64 qgroup_meta_rsv_pertrans;
 	u64 qgroup_meta_rsv_prealloc;
+	struct mutex qgroup_flushing_mutex;
 
 	/* Number of active swapfiles */
 	atomic_t nr_swapfiles;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c27022f13150..0116e0b487c9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1116,6 +1116,7 @@  static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
 	mutex_init(&root->delalloc_mutex);
+	mutex_init(&root->qgroup_flushing_mutex);
 	init_waitqueue_head(&root->log_writer_wait);
 	init_waitqueue_head(&root->log_commit_wait[0]);
 	init_waitqueue_head(&root->log_commit_wait[1]);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 84a452dea3f9..207eb52f9d80 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3513,17 +3513,59 @@  static int qgroup_revert(struct btrfs_inode *inode,
 }
 
 /*
- * Reserve qgroup space for range [start, start + len).
+ * Try to free some space for qgroup.
  *
- * This function will either reserve space from related qgroups or doing
- * nothing if the range is already reserved.
+ * For qgroup, there are only 3 ways to free qgroup space:
+ * - Flush nodatacow write
+ *   Any nodatacow write will free its reserved data space at
+ *   run_delalloc_range().
+ *   In theory, we should only flush nodatacow inodes, but it's
+ *   not yet possible, so we need to flush the whole root.
  *
- * Return 0 for successful reserve
- * Return <0 for error (including -EQUOT)
+ * - Wait for ordered extents
+ *   When ordered extents are finished, their reserved metadata
+ *   is finally converted to per_trans status, which can be freed
+ *   by later commit transaction.
  *
- * NOTE: this function may sleep for memory allocation.
+ * - Commit transaction
+ *   This would free the meta_per_trans space.
+ *   In theory this shouldn't provide much space, but any more qgroup space
+ *   is needed.
  */
-int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
+static int try_flush_qgroup(struct btrfs_root *root)
+{
+	struct btrfs_trans_handle *trans;
+	int ret;
+
+	/*
+	 * We don't want to run flush again and again, so if there is a running
+	 * one, we won't try to start a new flush, but exit directly.
+	 */
+	ret = mutex_trylock(&root->qgroup_flushing_mutex);
+	if (!ret) {
+		mutex_lock(&root->qgroup_flushing_mutex);
+		mutex_unlock(&root->qgroup_flushing_mutex);
+		return 0;
+	}
+
+	ret = btrfs_start_delalloc_snapshot(root);
+	if (ret < 0)
+		goto unlock;
+	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto unlock;
+	}
+
+	ret = btrfs_commit_transaction(trans);
+unlock:
+	mutex_unlock(&root->qgroup_flushing_mutex);
+	return ret;
+}
+
+static int qgroup_reserve_data(struct btrfs_inode *inode,
 			struct extent_changeset **reserved_ret, u64 start,
 			u64 len)
 {
@@ -3576,6 +3618,34 @@  int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 	return ret;
 }
 
+/*
+ * Reserve qgroup space for range [start, start + len).
+ *
+ * This function will either reserve space from related qgroups or doing
+ * nothing if the range is already reserved.
+ *
+ * Return 0 for successful reserve
+ * Return <0 for error (including -EQUOT)
+ *
+ * NOTE: This function may sleep for memory allocation, dirty page flushing and
+ *	 commit transaction. So caller should not hold any dirty page locked.
+ */
+int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
+			struct extent_changeset **reserved_ret, u64 start,
+			u64 len)
+{
+	int ret;
+
+	ret = qgroup_reserve_data(inode, reserved_ret, start, len);
+	if (ret <= 0 && ret != -EDQUOT)
+		return ret;
+
+	ret = try_flush_qgroup(inode->root);
+	if (ret < 0)
+		return ret;
+	return qgroup_reserve_data(inode, reserved_ret, start, len);
+}
+
 /* Free ranges specified by @reserved, normally in error path */
 static int qgroup_free_reserved_data(struct btrfs_inode *inode,
 			struct extent_changeset *reserved, u64 start, u64 len)
@@ -3744,7 +3814,7 @@  static int sub_root_meta_rsv(struct btrfs_root *root, int num_bytes,
 	return num_bytes;
 }
 
-int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
+static int qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 				enum btrfs_qgroup_rsv_type type, bool enforce)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -3771,6 +3841,21 @@  int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 	return ret;
 }
 
+int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
+				enum btrfs_qgroup_rsv_type type, bool enforce)
+{
+	int ret;
+
+	ret = qgroup_reserve_meta(root, num_bytes, type, enforce);
+	if (ret <= 0 && ret != -EDQUOT)
+		return ret;
+
+	ret = try_flush_qgroup(root);
+	if (ret < 0)
+		return ret;
+	return qgroup_reserve_meta(root, num_bytes, type, enforce);
+}
+
 void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;