diff mbox series

[2/9] btrfs: improve preemptive background space flushing

Message ID fc525d2a6a15a701d688b4f9f62f23caa51023bb.1601495426.git.josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series Improve preemptive ENOSPC flushing | expand

Commit Message

Josef Bacik Sept. 30, 2020, 8:01 p.m. UTC
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the file system that hold space
reservations until our space is reclaimed.

However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.

In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.

When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing.  However this meant that we essentially
never preemptively flushed.  This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.

The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction.  This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer.  With the patch to remove this behavior you got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.

To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on.  Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing.  Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h      |   1 +
 fs/btrfs/disk-io.c    |   1 +
 fs/btrfs/space-info.c | 122 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 122 insertions(+), 2 deletions(-)

Comments

Nikolay Borisov Oct. 1, 2020, 1:19 p.m. UTC | #1
On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
<snip>

> When I introduced the ticketed ENOSPC stuff this broke slightly in the
> fact that we were using tickets to indicate if we were done flushing.
> No tickets, no more flushing.  However this meant that we essentially
> never preemptively flushed.  This caused a write performance regression
> that Nikolay noticed in an unrelated patch that removed the committing
> of the transaction during btrfs_end_transaction.

I see, so basically the patch which I biseceted this to was really
papering over the initial bug since the logic in end_transaction, sort
of, simulated pre-emptive flushing... how subtle!

<snip>

> +	spin_lock(&space_info->lock);
> +	used = btrfs_space_info_used(space_info, true);
> +	while (need_do_async_reclaim(fs_info, space_info, used)) {
> +		enum btrfs_reserve_flush_enum flush;
> +		u64 delalloc_size = 0;
> +		u64 to_reclaim, block_rsv_size;
> +		u64 global_rsv_size = global_rsv->reserved;
> +
> +		/*
> +		 * If we're just full of pinned, commit the transaction.  We
> +		 * don't call flush_space(COMMIT_TRANS) here because that has
> +		 * logic to decide whether we need to commit the transaction to
> +		 * satisfy the ticket to keep us from live locking the box by
> +		 * committing over and over again.  Here we don't care about
> +		 * that, we know we are using a lot of space and most of it is
> +		 * pinned, just commit.

nit: That comment is a mouthful, I think what you are describing here is
really this line in may_commit_transaction:

if (!bytes_needed) return 0;

Which triggers if we don't have a ticket, if so there simply say :

"We can't call flush_commit because it will flush iff there is a pending
ticket".

<snip>

> +		/*
> +		 * We don't have a precise counter for delalloc, so we'll
> +		 * approximate it by subtracting out the block rsv's space from
> +		 * the bytes_may_use.  If that amount is higher than the
> +		 * individual reserves, then we can assume it's tied up in
> +		 * delalloc reservations.
> +		 */
> +		block_rsv_size = global_rsv_size +
> +			delayed_block_rsv->reserved +
> +			delayed_refs_rsv->reserved +
> +			trans_rsv->reserved;
> +		if (block_rsv_size < space_info->bytes_may_use)
> +			delalloc_size = space_info->bytes_may_use -
> +				block_rsv_size;

What about  :

percpu_counter_add_batch(&fs_info->delalloc_bytes, len,
                      fs_info->delalloc_batch);

> +		spin_unlock(&space_info->lock);
> +

<snip>
Josef Bacik Oct. 1, 2020, 9:35 p.m. UTC | #2
On 10/1/20 9:19 AM, Nikolay Borisov wrote:
> 
> 
> On 30.09.20 г. 23:01 ч., Josef Bacik wrote:
> <snip>
> 
>> When I introduced the ticketed ENOSPC stuff this broke slightly in the
>> fact that we were using tickets to indicate if we were done flushing.
>> No tickets, no more flushing.  However this meant that we essentially
>> never preemptively flushed.  This caused a write performance regression
>> that Nikolay noticed in an unrelated patch that removed the committing
>> of the transaction during btrfs_end_transaction.
> 
> I see, so basically the patch which I biseceted this to was really
> papering over the initial bug since the logic in end_transaction, sort
> of, simulated pre-emptive flushing... how subtle!
> 
> <snip>
> 
>> +	spin_lock(&space_info->lock);
>> +	used = btrfs_space_info_used(space_info, true);
>> +	while (need_do_async_reclaim(fs_info, space_info, used)) {
>> +		enum btrfs_reserve_flush_enum flush;
>> +		u64 delalloc_size = 0;
>> +		u64 to_reclaim, block_rsv_size;
>> +		u64 global_rsv_size = global_rsv->reserved;
>> +
>> +		/*
>> +		 * If we're just full of pinned, commit the transaction.  We
>> +		 * don't call flush_space(COMMIT_TRANS) here because that has
>> +		 * logic to decide whether we need to commit the transaction to
>> +		 * satisfy the ticket to keep us from live locking the box by
>> +		 * committing over and over again.  Here we don't care about
>> +		 * that, we know we are using a lot of space and most of it is
>> +		 * pinned, just commit.
> 
> nit: That comment is a mouthful, I think what you are describing here is
> really this line in may_commit_transaction:
> 
> if (!bytes_needed) return 0;
> 
> Which triggers if we don't have a ticket, if so there simply say :
> 
> "We can't call flush_commit because it will flush iff there is a pending
> ticket".
> 

Yup I'll reword.

> <snip>
> 
>> +		/*
>> +		 * We don't have a precise counter for delalloc, so we'll
>> +		 * approximate it by subtracting out the block rsv's space from
>> +		 * the bytes_may_use.  If that amount is higher than the
>> +		 * individual reserves, then we can assume it's tied up in
>> +		 * delalloc reservations.
>> +		 */
>> +		block_rsv_size = global_rsv_size +
>> +			delayed_block_rsv->reserved +
>> +			delayed_refs_rsv->reserved +
>> +			trans_rsv->reserved;
>> +		if (block_rsv_size < space_info->bytes_may_use)
>> +			delalloc_size = space_info->bytes_may_use -
>> +				block_rsv_size;
> 
> What about  :
> 
> percpu_counter_add_batch(&fs_info->delalloc_bytes, len,
>                        fs_info->delalloc_batch);
> 

Because that's the total data bytes, not how much metadata is reserved for the 
data bytes, which can be very grossly different things.  For example, one 
contiguous MAX_EXTENT_SIZE data extent is only like 1mib of metadata 
reservations, but if we used delalloc_bytes here we'd think the delalloc size 
was dominating the metadata reservations.  Thanks,

Josef
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aac3d6f4e35b..483182eaea19 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -922,6 +922,7 @@  struct btrfs_fs_info {
 	/* Used to reclaim the metadata space in the background. */
 	struct work_struct async_reclaim_work;
 	struct work_struct async_data_reclaim_work;
+	struct work_struct preempt_reclaim_work;
 
 	spinlock_t unused_bgs_lock;
 	struct list_head unused_bgs;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3d39f5d47ad3..2d0edc849da9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3999,6 +3999,7 @@  void __cold close_ctree(struct btrfs_fs_info *fs_info)
 
 	cancel_work_sync(&fs_info->async_reclaim_work);
 	cancel_work_sync(&fs_info->async_data_reclaim_work);
+	cancel_work_sync(&fs_info->preempt_reclaim_work);
 
 	/* Cancel or finish ongoing discard work */
 	btrfs_discard_cleanup(fs_info);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 40726c8888f7..024daa843c56 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -987,6 +987,122 @@  static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 	} while (flush_state <= COMMIT_TRANS);
 }
 
+/*
+ * This handles pre-flushing of metadata space before we get to the point that
+ * we need to start blocking people on tickets.  The logic here is different
+ * from the other flush paths because it doesn't rely on tickets to tell us how
+ * much we need to flush, instead it attempts to keep us below the 80% full
+ * watermark of space by flushing whichever reservation pool is currently the
+ * largest.
+ */
+static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
+{
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_space_info *space_info;
+	struct btrfs_block_rsv *delayed_block_rsv;
+	struct btrfs_block_rsv *delayed_refs_rsv;
+	struct btrfs_block_rsv *global_rsv;
+	struct btrfs_block_rsv *trans_rsv;
+	u64 used;
+
+	fs_info = container_of(work, struct btrfs_fs_info,
+			       preempt_reclaim_work);
+	space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
+	delayed_block_rsv = &fs_info->delayed_block_rsv;
+	delayed_refs_rsv = &fs_info->delayed_refs_rsv;
+	global_rsv = &fs_info->global_block_rsv;
+	trans_rsv = &fs_info->trans_block_rsv;
+
+	spin_lock(&space_info->lock);
+	used = btrfs_space_info_used(space_info, true);
+	while (need_do_async_reclaim(fs_info, space_info, used)) {
+		enum btrfs_reserve_flush_enum flush;
+		u64 delalloc_size = 0;
+		u64 to_reclaim, block_rsv_size;
+		u64 global_rsv_size = global_rsv->reserved;
+
+		/*
+		 * If we're just full of pinned, commit the transaction.  We
+		 * don't call flush_space(COMMIT_TRANS) here because that has
+		 * logic to decide whether we need to commit the transaction to
+		 * satisfy the ticket to keep us from live locking the box by
+		 * committing over and over again.  Here we don't care about
+		 * that, we know we are using a lot of space and most of it is
+		 * pinned, just commit.
+		 *
+		 * We do not want to include the global rsv size in the
+		 * bytes_may_use size here because it's unreclaimable.
+		 */
+		to_reclaim = (space_info->bytes_may_use < global_rsv_size) ? 0:
+			space_info->bytes_may_use - global_rsv_size;
+		if (space_info->bytes_pinned >= to_reclaim) {
+			struct btrfs_trans_handle *trans;
+
+			spin_unlock(&space_info->lock);
+			trans = btrfs_join_transaction(fs_info->extent_root);
+			if (IS_ERR(trans))
+				return;
+			btrfs_commit_transaction(trans);
+			goto next;
+		}
+
+		/*
+		 * We don't have a precise counter for delalloc, so we'll
+		 * approximate it by subtracting out the block rsv's space from
+		 * the bytes_may_use.  If that amount is higher than the
+		 * individual reserves, then we can assume it's tied up in
+		 * delalloc reservations.
+		 */
+		block_rsv_size = global_rsv_size +
+			delayed_block_rsv->reserved +
+			delayed_refs_rsv->reserved +
+			trans_rsv->reserved;
+		if (block_rsv_size < space_info->bytes_may_use)
+			delalloc_size = space_info->bytes_may_use -
+				block_rsv_size;
+		spin_unlock(&space_info->lock);
+
+		/*
+		 * We don't want to include the global_rsv in our calculation,
+		 * because that's space we can't touch.  Subtract it from the
+		 * block_rsv_size for the next checks.
+		 */
+		block_rsv_size -= global_rsv_size;
+
+		/*
+		 * We really want to avoid flushing delalloc too much, as it
+		 * could result in poor allocation patterns, so only flush it if
+		 * it's larger than the rest of the pools combined.
+		 */
+		if (delalloc_size > block_rsv_size) {
+			to_reclaim = delalloc_size;
+			flush = FLUSH_DELALLOC;
+		} else if (delayed_block_rsv->reserved >
+			   delayed_refs_rsv->reserved) {
+			to_reclaim = delayed_block_rsv->reserved;
+			flush = FLUSH_DELAYED_ITEMS_NR;
+		} else {
+			to_reclaim = delayed_refs_rsv->reserved;
+			flush = FLUSH_DELAYED_REFS_NR;
+		}
+
+		/*
+		 * We don't want to reclaim everything, just a portion, so scale
+		 * down the to_reclaim by 1/4.  If it takes us down to 0,
+		 * reclaim 1 items worth.
+		 */
+		to_reclaim >>= 2;
+		if (!to_reclaim)
+			to_reclaim = btrfs_calc_insert_metadata_size(fs_info, 1);
+		flush_space(fs_info, space_info, to_reclaim, flush);
+next:
+		cond_resched();
+		spin_lock(&space_info->lock);
+		used = btrfs_space_info_used(space_info, true);
+	}
+	spin_unlock(&space_info->lock);
+}
+
 /*
  * FLUSH_DELALLOC_WAIT:
  *   Space is freed from flushing delalloc in one of two ways.
@@ -1113,6 +1229,8 @@  void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info)
 {
 	INIT_WORK(&fs_info->async_reclaim_work, btrfs_async_reclaim_metadata_space);
 	INIT_WORK(&fs_info->async_data_reclaim_work, btrfs_async_reclaim_data_space);
+	INIT_WORK(&fs_info->preempt_reclaim_work,
+		  btrfs_preempt_reclaim_metadata_space);
 }
 
 static const enum btrfs_flush_state priority_flush_states[] = {
@@ -1392,11 +1510,11 @@  static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 		 */
 		if (!test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags) &&
 		    need_do_async_reclaim(fs_info, space_info, used) &&
-		    !work_busy(&fs_info->async_reclaim_work)) {
+		    !work_busy(&fs_info->preempt_reclaim_work)) {
 			trace_btrfs_trigger_flush(fs_info, space_info->flags,
 						  orig_bytes, flush, "preempt");
 			queue_work(system_unbound_wq,
-				   &fs_info->async_reclaim_work);
+				   &fs_info->preempt_reclaim_work);
 		}
 	}
 	spin_unlock(&space_info->lock);