diff mbox series

[03/10] btrfs: offload all write I/O completions to a workqueue

Message ID 20230314165910.373347-4-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [01/10] btrfs: use a plain workqueue for ordered_extent processing | expand

Commit Message

hch@lst.de March 14, 2023, 4:59 p.m. UTC
Currently all reads are offloaded to workqueues for checksum
verification.  Writes are initially processed in the bi_end_io
context that usually happens in hard or soft interrupts, and are
only offloaded to a workqueue for ordered extent completion or
compressed extent handling, and in the future for raid stripe
tree handling.

Switch to offload all writes to workqueues instead of doing that
only for the final ordered extent processing.  This means that
there are slightly workqueue offloads for large writes as each
submitted btrfs_bio can only hold 1MiB worth of data on systems
with a 4K page size.  On the other hand this will allow large
simpliciations by always having a process context.

For metadata writes there was not offload at all before, which
creates various problems like not being able to drop the final
extent_buffered reference from the end_io handler.

With the entire series applied this shows no performance differences
on data heavy workload, while for metadata heavy workloads like
Filipes fs_mark there also is no change in averages, while it
seems to somewhat reduce the outliers in terms of best and worst
performance.  This is probably due to the removal of the irq
disabling in the next patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/bio.c          | 74 ++++++++++++++++++++++++++++-------------
 fs/btrfs/disk-io.c      |  5 +++
 fs/btrfs/fs.h           |  1 +
 fs/btrfs/ordered-data.c | 17 +---------
 fs/btrfs/ordered-data.h |  2 --
 5 files changed, 57 insertions(+), 42 deletions(-)

Comments

Johannes Thumshirn March 16, 2023, 5:14 p.m. UTC | #1
On 14.03.23 17:59, Christoph Hellwig wrote:
> +	if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE) {
> +		if (btrfs_is_free_space_inode(bbio->inode))
> +			return fs_info->endio_freespace_worker;
> +		if (bbio->bio.bi_opf & REQ_META)
> +			return fs_info->endio_write_meta_workers;
> +		return fs_info->endio_write_workers;
> +	} else {
> +		if (bbio->bio.bi_opf & REQ_META)
> +			return fs_info->endio_meta_workers;
> +		return fs_info->endio_workers;
> +	}


Can't that else be dropped as well?
Qu Wenruo March 20, 2023, 11:29 a.m. UTC | #2
On 2023/3/15 00:59, Christoph Hellwig wrote:
> Currently all reads are offloaded to workqueues for checksum
> verification.  Writes are initially processed in the bi_end_io
> context that usually happens in hard or soft interrupts, and are
> only offloaded to a workqueue for ordered extent completion or
> compressed extent handling, and in the future for raid stripe
> tree handling.
> 
> Switch to offload all writes to workqueues instead of doing that
> only for the final ordered extent processing.  This means that
> there are slightly workqueue offloads for large writes as each
> submitted btrfs_bio can only hold 1MiB worth of data on systems
> with a 4K page size.  On the other hand this will allow large
> simpliciations by always having a process context.
> 
> For metadata writes there was not offload at all before, which
> creates various problems like not being able to drop the final
> extent_buffered reference from the end_io handler.
> 
> With the entire series applied this shows no performance differences
> on data heavy workload, while for metadata heavy workloads like
> Filipes fs_mark there also is no change in averages, while it
> seems to somewhat reduce the outliers in terms of best and worst
> performance.  This is probably due to the removal of the irq
> disabling in the next patches.

I'm uncertain what is proper or the better solution when handling the 
endio functions.

Sure, they are called in very strict context, thus we should keep them 
short.
But on the other hand, we're already having too many workqueues, and I'm 
always wondering under what situation they can lead to deadlock.
(e.g. why we need to queue endios for free space and regular data inodes 
into different workqueues?)

My current method is always consider the workqueue has only 1 
max_active, but I'm still not sure for such case, what would happen if 
one work slept?
Would the workqueue being able to choose the next work item? Or that 
workqueue is stalled until the only active got woken?


Thus I'm uncertain if we're going the correct way.

Personally speaking, I'd like to keep the btrfs bio endio function calls 
in the old soft/hard irq context, and let the higher layer to queue the 
work.

However we have already loosen the endio context for btrfs bio, from the 
old soft/hard irq to the current workqueue context already...

> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/btrfs/bio.c          | 74 ++++++++++++++++++++++++++++-------------
>   fs/btrfs/disk-io.c      |  5 +++
>   fs/btrfs/fs.h           |  1 +
>   fs/btrfs/ordered-data.c | 17 +---------
>   fs/btrfs/ordered-data.h |  2 --
>   5 files changed, 57 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 85539964864a34..037eded3b33ba2 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -300,20 +300,33 @@ static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_bio *bbio)
>   {
>   	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
>   
> -	if (bbio->bio.bi_opf & REQ_META)
> -		return fs_info->endio_meta_workers;
> -	return fs_info->endio_workers;
> +	if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE) {
> +		if (btrfs_is_free_space_inode(bbio->inode))
> +			return fs_info->endio_freespace_worker;
> +		if (bbio->bio.bi_opf & REQ_META)
> +			return fs_info->endio_write_meta_workers;
> +		return fs_info->endio_write_workers;
> +	} else {
> +		if (bbio->bio.bi_opf & REQ_META)
> +			return fs_info->endio_meta_workers;
> +		return fs_info->endio_workers;
> +	}
>   }
>   
> -static void btrfs_end_bio_work(struct work_struct *work)
> +static void btrfs_simple_end_bio_work(struct work_struct *work)
>   {
>   	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
> +	struct bio *bio = &bbio->bio;
> +	struct btrfs_device *dev = bio->bi_private;
>   
>   	/* Metadata reads are checked and repaired by the submitter. */
> -	if (bbio->bio.bi_opf & REQ_META)
> -		bbio->end_io(bbio);
> -	else
> -		btrfs_check_read_bio(bbio, bbio->bio.bi_private);
> +	if (bio_op(bio) == REQ_OP_READ && !(bio->bi_opf & REQ_META)) {
> +		btrfs_check_read_bio(bbio, dev);
> +	} else {
> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> +			btrfs_record_physical_zoned(bbio);
> +		btrfs_orig_bbio_end_io(bbio);
> +	}
>   }
>   
>   static void btrfs_simple_end_io(struct bio *bio)
> @@ -322,18 +335,19 @@ static void btrfs_simple_end_io(struct bio *bio)
>   	struct btrfs_device *dev = bio->bi_private;
>   
>   	btrfs_bio_counter_dec(bbio->inode->root->fs_info);
> -
>   	if (bio->bi_status)
>   		btrfs_log_dev_io_error(bio, dev);
>   
> -	if (bio_op(bio) == REQ_OP_READ) {
> -		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
> -		queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
> -	} else {
> -		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> -			btrfs_record_physical_zoned(bbio);
> -		btrfs_orig_bbio_end_io(bbio);
> -	}

I'm already seeing in the future, there would be extra if checks to skip 
the work queue for scrub bios...

> +	INIT_WORK(&bbio->end_io_work, btrfs_simple_end_bio_work);
> +	queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
> +}
> +
> +static void btrfs_write_end_bio_work(struct work_struct *work)
> +{
> +	struct btrfs_bio *bbio =
> +		container_of(work, struct btrfs_bio, end_io_work);
> +
> +	btrfs_orig_bbio_end_io(bbio);
>   }
>   
>   static void btrfs_raid56_end_io(struct bio *bio)
> @@ -343,12 +357,23 @@ static void btrfs_raid56_end_io(struct bio *bio)
>   
>   	btrfs_bio_counter_dec(bioc->fs_info);
>   	bbio->mirror_num = bioc->mirror_num;
> -	if (bio_op(bio) == REQ_OP_READ && !(bbio->bio.bi_opf & REQ_META))
> -		btrfs_check_read_bio(bbio, NULL);
> -	else
> -		btrfs_orig_bbio_end_io(bbio);
> -
>   	btrfs_put_bioc(bioc);
> +
> +	/*
> +	 * The RAID5/6 code already completes all bios from workqueues, but for
> +	 * writs we need to offload them again to avoid deadlocks in the ordered
> +	 * extent processing.
> +	 */

Mind to share some details about the possible deadlock here?

To me, it's just a different workqueue doing the finish ordered io workload.

Thanks,
Qu

> +	if (bio_op(bio) == REQ_OP_READ) {
> +		/* Metadata reads are checked and repaired by the submitter. */
> +		if (!(bio->bi_opf & REQ_META))
> +			btrfs_check_read_bio(bbio, NULL);
> +		else
> +			btrfs_orig_bbio_end_io(bbio);
> +	} else {
> +		INIT_WORK(&btrfs_bio(bio)->end_io_work, btrfs_write_end_bio_work);
> +		queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
> +	}
>   }
>   
>   static void btrfs_orig_write_end_io(struct bio *bio)
> @@ -372,9 +397,10 @@ static void btrfs_orig_write_end_io(struct bio *bio)
>   		bio->bi_status = BLK_STS_IOERR;
>   	else
>   		bio->bi_status = BLK_STS_OK;
> -
> -	btrfs_orig_bbio_end_io(bbio);
>   	btrfs_put_bioc(bioc);
> +
> +	INIT_WORK(&bbio->end_io_work, btrfs_write_end_bio_work);
> +	queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
>   }
>   
>   static void btrfs_clone_write_end_io(struct bio *bio)
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 0889eb81e71a7d..5b63a5161cedea 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2008,6 +2008,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
>   	 * the queues used for metadata I/O, since tasks from those other work
>   	 * queues can do metadata I/O operations.
>   	 */
> +	if (fs_info->endio_write_meta_workers)
> +		destroy_workqueue(fs_info->endio_write_meta_workers);
>   	if (fs_info->endio_meta_workers)
>   		destroy_workqueue(fs_info->endio_meta_workers);
>   }
> @@ -2204,6 +2206,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>   		alloc_workqueue("btrfs-endio", flags, max_active);
>   	fs_info->endio_meta_workers =
>   		alloc_workqueue("btrfs-endio-meta", flags, max_active);
> +	fs_info->endio_write_meta_workers =
> +		alloc_workqueue("btrfs-endio-write-meta", flags, max_active);
>   	fs_info->rmw_workers = alloc_workqueue("btrfs-rmw", flags, max_active);
>   	fs_info->endio_write_workers =
>   		alloc_workqueue("btrfs-endio-write", flags, max_active);
> @@ -2224,6 +2228,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>   	      fs_info->endio_workers && fs_info->endio_meta_workers &&
>   	      fs_info->compressed_write_workers &&
>   	      fs_info->endio_write_workers &&
> +	      fs_info->endio_write_meta_workers &&
>   	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
>   	      fs_info->caching_workers && fs_info->fixup_workers &&
>   	      fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
> diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
> index 276a17780f2b1b..e2643bc5c039ad 100644
> --- a/fs/btrfs/fs.h
> +++ b/fs/btrfs/fs.h
> @@ -543,6 +543,7 @@ struct btrfs_fs_info {
>   	struct workqueue_struct *rmw_workers;
>   	struct workqueue_struct *compressed_write_workers;
>   	struct workqueue_struct *endio_write_workers;
> +	struct workqueue_struct *endio_write_meta_workers;
>   	struct workqueue_struct *endio_freespace_worker;
>   	struct btrfs_workqueue *caching_workers;
>   
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 23f496f0d7b776..48c7510df80a0d 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -303,14 +303,6 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
>   	spin_unlock_irq(&tree->lock);
>   }
>   
> -static void finish_ordered_fn(struct work_struct *work)
> -{
> -	struct btrfs_ordered_extent *ordered_extent;
> -
> -	ordered_extent = container_of(work, struct btrfs_ordered_extent, work);
> -	btrfs_finish_ordered_io(ordered_extent);
> -}
> -
>   /*
>    * Mark all ordered extents io inside the specified range finished.
>    *
> @@ -330,17 +322,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>   {
>   	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct workqueue_struct *wq;
>   	struct rb_node *node;
>   	struct btrfs_ordered_extent *entry = NULL;
>   	unsigned long flags;
>   	u64 cur = file_offset;
>   
> -	if (btrfs_is_free_space_inode(inode))
> -		wq = fs_info->endio_freespace_worker;
> -	else
> -		wq = fs_info->endio_write_workers;
> -
>   	if (page)
>   		ASSERT(page->mapping && page_offset(page) <= file_offset &&
>   		       file_offset + num_bytes <= page_offset(page) + PAGE_SIZE);
> @@ -439,8 +425,7 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>   			refcount_inc(&entry->refs);
>   			trace_btrfs_ordered_extent_mark_finished(inode, entry);
>   			spin_unlock_irqrestore(&tree->lock, flags);
> -			INIT_WORK(&entry->work, finish_ordered_fn);
> -			queue_work(wq, &entry->work);
> +			btrfs_finish_ordered_io(entry);
>   			spin_lock_irqsave(&tree->lock, flags);
>   		}
>   		cur += len;
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index b8a92f040458f0..64a37d42e7a24a 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -146,8 +146,6 @@ struct btrfs_ordered_extent {
>   	/* a per root list of all the pending ordered extents */
>   	struct list_head root_extent_list;
>   
> -	struct work_struct work;
> -
>   	struct completion completion;
>   	struct btrfs_work flush_work;
>   	struct list_head work_list;
hch@lst.de March 20, 2023, 12:30 p.m. UTC | #3
On Mon, Mar 20, 2023 at 07:29:38PM +0800, Qu Wenruo wrote:
> Sure, they are called in very strict context, thus we should keep them 
> short.
> But on the other hand, we're already having too many workqueues, and I'm 
> always wondering under what situation they can lead to deadlock.
> (e.g. why we need to queue endios for free space and regular data inodes 
> into different workqueues?)

In general the reason for separate workqueues is if one workqueue depends
on the execution of another.  It seems like this is Josef's area, but
my impression is that finishing and ordered extent can cause writeback
and a possible wait for data in the freespace inode.  Normally such
workqueue splits should have comments in the code to explain them, but
so far I haven't found one.

> My current method is always consider the workqueue has only 1 max_active, 
> but I'm still not sure for such case, what would happen if one work slept?

That's my understanding of the workqueue mechanisms, yes.

> Would the workqueue being able to choose the next work item? Or that 
> workqueue is stalled until the only active got woken?

I think it is stalled.  That's why the workqueue heavily discourages
limiting max_active unless you have a good reason to, and most callers
follow that advise.

> Personally speaking, I'd like to keep the btrfs bio endio function calls in 
> the old soft/hard irq context, and let the higher layer to queue the work.

Can you explain why?

> However we have already loosen the endio context for btrfs bio, from the 
> old soft/hard irq to the current workqueue context already...

read I/O are executed in a workqueue.  For write completions there also
are various offloads, but none that is consistent and dependable so far.
Qu Wenruo March 20, 2023, 11:37 p.m. UTC | #4
On 2023/3/20 20:30, Christoph Hellwig wrote:
> On Mon, Mar 20, 2023 at 07:29:38PM +0800, Qu Wenruo wrote:
>> Sure, they are called in very strict context, thus we should keep them
>> short.
>> But on the other hand, we're already having too many workqueues, and I'm
>> always wondering under what situation they can lead to deadlock.
>> (e.g. why we need to queue endios for free space and regular data inodes
>> into different workqueues?)
> 
> In general the reason for separate workqueues is if one workqueue depends
> on the execution of another.  It seems like this is Josef's area, but
> my impression is that finishing and ordered extent can cause writeback
> and a possible wait for data in the freespace inode.  Normally such
> workqueue splits should have comments in the code to explain them, but
> so far I haven't found one.
> 
>> My current method is always consider the workqueue has only 1 max_active,
>> but I'm still not sure for such case, what would happen if one work slept?
> 
> That's my understanding of the workqueue mechanisms, yes.
> 
>> Would the workqueue being able to choose the next work item? Or that
>> workqueue is stalled until the only active got woken?
> 
> I think it is stalled.  That's why the workqueue heavily discourages
> limiting max_active unless you have a good reason to, and most callers
> follow that advise.

To me, this looks more like hiding a bug in the workqueue user.
Shouldn't we expose such bugs instead of hiding them?

Furthermore although I'm a big fan of plain workqueue, the old btrfs 
workqueues allows us to apply max_active to the plain workqueues, but 
now we don't have this ability any more.

Thus at least, we should bring back the old behavior, and possibly fix 
whatever bugs in our workqueue usage.

> 
>> Personally speaking, I'd like to keep the btrfs bio endio function calls in
>> the old soft/hard irq context, and let the higher layer to queue the work.
> 
> Can you explain why?

Just to keep the context consistent.

Image a situation, where we put some sleep functions into a endio 
function without any doubt, and fully rely on the fact the endio 
function is executed under workqueue context.

Or we have to extra comments explain the situation every time we're 
doing something like this.

Thanks,
Qu
> 
>> However we have already loosen the endio context for btrfs bio, from the
>> old soft/hard irq to the current workqueue context already...
> 
> read I/O are executed in a workqueue.  For write completions there also
> are various offloads, but none that is consistent and dependable so far.
hch@lst.de March 21, 2023, 12:55 p.m. UTC | #5
On Tue, Mar 21, 2023 at 07:37:46AM +0800, Qu Wenruo wrote:
>> I think it is stalled.  That's why the workqueue heavily discourages
>> limiting max_active unless you have a good reason to, and most callers
>> follow that advise.
>
> To me, this looks more like hiding a bug in the workqueue user.
> Shouldn't we expose such bugs instead of hiding them?

If you limit max_active to a certain value, you clearly tell the
workqueue code not not use more workers that that.  That is what the
argument is for.

I don't see where there is a bug, and that someone is hiding it.

> Furthermore although I'm a big fan of plain workqueue, the old btrfs 
> workqueues allows us to apply max_active to the plain workqueues, but now 
> we don't have this ability any more.

You can pass max_active to plain Linux workqueues and there is a bunch
of places that do that, although the vast majority passes either 0 to
use the default, or 1 to limit themselves to a single active worker.

The core also allows to change it, but nothing but the btrfs_workqueue
code ever does.  We can wire up btrfs to change the conccurency if we
have a good reason to do.  And I'd like to figure out if there is one,
and if yes for which workqueue, but for that we'd need to have an
argument for why which workqueue would like to use a larger or smaller
than default value.  The old argument of higher resource usage with
a bigger number of workers does not apply any more with the concurrency
managed workqueue implementation.

>>> Personally speaking, I'd like to keep the btrfs bio endio function calls in
>>> the old soft/hard irq context, and let the higher layer to queue the work.
>>
>> Can you explain why?
>
> Just to keep the context consistent.

Which is what this series does.  Before all read I/O completion handlers
were called in workqueue context, while write ones weren't.  With the
series write completion handlers are called from workqueue context
as well, and with that all significant btrfs code except for tiny bits
of bi_end_io handlers are called from process context.

> Image a situation, where we put some sleep functions into a endio function 
> without any doubt, and fully rely on the fact the endio function is 
> executed under workqueue context.

Being able to do that is the point of this series.
Qu Wenruo March 21, 2023, 11:37 p.m. UTC | #6
On 2023/3/21 20:55, Christoph Hellwig wrote:
> On Tue, Mar 21, 2023 at 07:37:46AM +0800, Qu Wenruo wrote:
>>> I think it is stalled.  That's why the workqueue heavily discourages
>>> limiting max_active unless you have a good reason to, and most callers
>>> follow that advise.
>>
>> To me, this looks more like hiding a bug in the workqueue user.
>> Shouldn't we expose such bugs instead of hiding them?
> 
> If you limit max_active to a certain value, you clearly tell the
> workqueue code not not use more workers that that.  That is what the
> argument is for.

And if a work load can only be deadlock free using the default 
max_active, but not any value smaller, then I'd say the work load itself 
is buggy.

> 
> I don't see where there is a bug, and that someone is hiding it.
> 
>> Furthermore although I'm a big fan of plain workqueue, the old btrfs
>> workqueues allows us to apply max_active to the plain workqueues, but now
>> we don't have this ability any more.
> 
> You can pass max_active to plain Linux workqueues and there is a bunch
> of places that do that, although the vast majority passes either 0 to
> use the default, or 1 to limit themselves to a single active worker.
> 
> The core also allows to change it, but nothing but the btrfs_workqueue
> code ever does.  We can wire up btrfs to change the conccurency if we
> have a good reason to do.  And I'd like to figure out if there is one,
> and if yes for which workqueue, but for that we'd need to have an
> argument for why which workqueue would like to use a larger or smaller
> than default value.  The old argument of higher resource usage with
> a bigger number of workers does not apply any more with the concurrency
> managed workqueue implementation.

The usecase is still there.
To limit the amount of CPU time spent by btrfs workloads, from csum 
verification to compression.

And if limiting max_active for plain workqueue could help us to expose 
possible deadlocks, it would be extra benefits.

Thanks,
Qu

> 
>>>> Personally speaking, I'd like to keep the btrfs bio endio function calls in
>>>> the old soft/hard irq context, and let the higher layer to queue the work.
>>>
>>> Can you explain why?
>>
>> Just to keep the context consistent.
> 
> Which is what this series does.  Before all read I/O completion handlers
> were called in workqueue context, while write ones weren't.  With the
> series write completion handlers are called from workqueue context
> as well, and with that all significant btrfs code except for tiny bits
> of bi_end_io handlers are called from process context.
> 
>> Image a situation, where we put some sleep functions into a endio function
>> without any doubt, and fully rely on the fact the endio function is
>> executed under workqueue context.
> 
> Being able to do that is the point of this series.
hch@lst.de March 22, 2023, 8:32 a.m. UTC | #7
On Wed, Mar 22, 2023 at 07:37:07AM +0800, Qu Wenruo wrote:
>> If you limit max_active to a certain value, you clearly tell the
>> workqueue code not not use more workers that that.  That is what the
>> argument is for.
>
> And if a work load can only be deadlock free using the default max_active, 
> but not any value smaller, then I'd say the work load itself is buggy.

Anything that has an interaction between two instances of a work_struct
can deadlock.  Only a single execution context is guaranteed (and even
that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due
to e.g. only using a global workqueue in file systems or block devices
that can stack.

Fortunately these days lockdep is generally able to catch these
dependencies as well.

> The usecase is still there.
> To limit the amount of CPU time spent by btrfs workloads, from csum 
> verification to compression.

So this is the first time I see an actual explanation, thanks for that
first.  If this is the reason we should apply the max_active to all
workqueus that do csum an compression work, but not to other random
workqueues.

Dave, Josef, Chis: do you agree to this interpretation?
Qu Wenruo March 23, 2023, 8:07 a.m. UTC | #8
On 2023/3/22 16:32, Christoph Hellwig wrote:
> On Wed, Mar 22, 2023 at 07:37:07AM +0800, Qu Wenruo wrote:
>>> If you limit max_active to a certain value, you clearly tell the
>>> workqueue code not not use more workers that that.  That is what the
>>> argument is for.
>>
>> And if a work load can only be deadlock free using the default max_active,
>> but not any value smaller, then I'd say the work load itself is buggy.
> 
> Anything that has an interaction between two instances of a work_struct
> can deadlock.  Only a single execution context is guaranteed (and even
> that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due
> to e.g. only using a global workqueue in file systems or block devices
> that can stack.

Shouldn't we avoid such cross workqueue workload at all cost?

IIRC at least we don't have such usage in btrfs.
And I hope we will never have.

> 
> Fortunately these days lockdep is generally able to catch these
> dependencies as well.
> 
>> The usecase is still there.
>> To limit the amount of CPU time spent by btrfs workloads, from csum
>> verification to compression.
> 
> So this is the first time I see an actual explanation, thanks for that
> first.  If this is the reason we should apply the max_active to all
> workqueus that do csum an compression work, but not to other random
> workqueues.

If we're limiting the max_active for certain workqueues, then I'd say 
why not to all workqueues?

If we have some usage relying on the amount of workers, at least we 
should be able to expose it and fix it.
(IIRC we should have a better way with less cross-workqueue dependency)

Thanks,
Qu

> 
> Dave, Josef, Chis: do you agree to this interpretation?
hch@lst.de March 23, 2023, 8:12 a.m. UTC | #9
On Thu, Mar 23, 2023 at 04:07:28PM +0800, Qu Wenruo wrote:
>>> And if a work load can only be deadlock free using the default max_active,
>>> but not any value smaller, then I'd say the work load itself is buggy.
>>
>> Anything that has an interaction between two instances of a work_struct
>> can deadlock.  Only a single execution context is guaranteed (and even
>> that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due
>> to e.g. only using a global workqueue in file systems or block devices
>> that can stack.
>
> Shouldn't we avoid such cross workqueue workload at all cost?

Yes, btrfs uses per-sb workqueues.  As do most other places now,
but there as a bit of a learning curve years ago.

>> So this is the first time I see an actual explanation, thanks for that
>> first.  If this is the reason we should apply the max_active to all
>> workqueus that do csum an compression work, but not to other random
>> workqueues.
>
> If we're limiting the max_active for certain workqueues, then I'd say why 
> not to all workqueues?
>
> If we have some usage relying on the amount of workers, at least we should 
> be able to expose it and fix it.

Again, btrfs is the odd one out allowing the user to set arbitrary limits.
This code predates using the kernel workqueues, and I'm a little
doubtful it still is useful.  But for that I need to figure out why
it was even be kept when converting btrfs to use workqueues.

> (IIRC we should have a better way with less cross-workqueue dependency)

I've been very actively working on reducing the amount of different
workqueues.  This series is an important part of that.
Qu Wenruo March 23, 2023, 8:20 a.m. UTC | #10
On 2023/3/23 16:12, Christoph Hellwig wrote:
> On Thu, Mar 23, 2023 at 04:07:28PM +0800, Qu Wenruo wrote:
>>>> And if a work load can only be deadlock free using the default max_active,
>>>> but not any value smaller, then I'd say the work load itself is buggy.
>>>
>>> Anything that has an interaction between two instances of a work_struct
>>> can deadlock.  Only a single execution context is guaranteed (and even
>>> that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due
>>> to e.g. only using a global workqueue in file systems or block devices
>>> that can stack.
>>
>> Shouldn't we avoid such cross workqueue workload at all cost?
> 
> Yes, btrfs uses per-sb workqueues.  As do most other places now,
> but there as a bit of a learning curve years ago.
> 
>>> So this is the first time I see an actual explanation, thanks for that
>>> first.  If this is the reason we should apply the max_active to all
>>> workqueus that do csum an compression work, but not to other random
>>> workqueues.
>>
>> If we're limiting the max_active for certain workqueues, then I'd say why
>> not to all workqueues?
>>
>> If we have some usage relying on the amount of workers, at least we should
>> be able to expose it and fix it.
> 
> Again, btrfs is the odd one out allowing the user to set arbitrary limits.
> This code predates using the kernel workqueues, and I'm a little
> doubtful it still is useful.  But for that I need to figure out why
> it was even be kept when converting btrfs to use workqueues.
> 
>> (IIRC we should have a better way with less cross-workqueue dependency)
> 
> I've been very actively working on reducing the amount of different
> workqueues.  This series is an important part of that.
> 
Yeah, your work is very appreciated, that's without any doubt.

I'm just wondering if we all know that we should avoid cross-workqueue 
dependency to avoid deadlock, then why no one is trying to expose any 
such deadlocks by simply limiting max_active to 1 for all workqueues?

Especially with lockdep, once we got a reproduce, it should not be that 
hard to fix.

That's the only other valid usecase other than limiting CPU usage for 
csum/compression, I know regular users should not touch this, but us 
developer should still be able to create a worst case for our CI systems 
to suffer.

Thanks,
Qu
Chris Mason March 23, 2023, 2:53 p.m. UTC | #11
On 3/22/23 4:32 AM, Christoph Hellwig wrote:
> On Wed, Mar 22, 2023 at 07:37:07AM +0800, Qu Wenruo wrote:
>>> If you limit max_active to a certain value, you clearly tell the
>>> workqueue code not not use more workers that that.  That is what the
>>> argument is for.
>>
>> And if a work load can only be deadlock free using the default max_active, 
>> but not any value smaller, then I'd say the work load itself is buggy.
> 
> Anything that has an interaction between two instances of a work_struct
> can deadlock.  Only a single execution context is guaranteed (and even
> that only with WQ_MEM_RECLAIM), and we've seen plenty of deadlocks due
> to e.g. only using a global workqueue in file systems or block devices
> that can stack.
> 
> Fortunately these days lockdep is generally able to catch these
> dependencies as well.
> 
>> The usecase is still there.
>> To limit the amount of CPU time spent by btrfs workloads, from csum 
>> verification to compression.
> 
> So this is the first time I see an actual explanation, thanks for that
> first.  If this is the reason we should apply the max_active to all
> workqueus that do csum an compression work, but not to other random
> workqueues.
> 
> Dave, Josef, Chis: do you agree to this interpretation?

The original reason for limiting the number of workers was that work
like crc calculations was causing tons of context switches.  This isn't
a surprise, there were a ton of work items, and each one was processed
very quickly.

So the original btrfs thread pools had optimizations meant to limit
context switches by preferring to give work to a smaller number of workers.

For compression it's more clear cut.  I wanted the ability to saturate
all the CPUs with compression work, but also wanted a default that
didn't take over the whole machine.

-chris
hch@lst.de March 24, 2023, 1:09 a.m. UTC | #12
On Thu, Mar 23, 2023 at 10:53:16AM -0400, Chris Mason wrote:
> The original reason for limiting the number of workers was that work
> like crc calculations was causing tons of context switches.  This isn't
> a surprise, there were a ton of work items, and each one was processed
> very quickly.

Yeah.  Although quite a bit of that went away since, by not doing
the offload for sync I/O, not for metadata when there is a fast crc32
implementation (we need to do the same for small data I/O and hardware
accelerated sha256, btw), and by doing these in larger chunks, but ..

> 
> So the original btrfs thread pools had optimizations meant to limit
> context switches by preferring to give work to a smaller number of workers.

.. this is something that the workqueue code already does under the hood
anyway.

> For compression it's more clear cut.  I wanted the ability to saturate
> all the CPUs with compression work, but also wanted a default that
> didn't take over the whole machine.

And that's actually a very good use case.

It almost asks for a separate option just for compression, or at least
for compression and checksumming only.

Is there consensus that for now we should limit thread_pool for 
se workqueues that do compression and chekcsumming for now?  Then
I'll send a series to wire it up for the read completion workqueue
again that does the checksum verification, the compressed write
workqueue, but drop it for all others but the write submission
one?
hch@lst.de March 24, 2023, 1:11 a.m. UTC | #13
On Thu, Mar 23, 2023 at 04:20:45PM +0800, Qu Wenruo wrote:
> I'm just wondering if we all know that we should avoid cross-workqueue 
> dependency to avoid deadlock, then why no one is trying to expose any such 
> deadlocks by simply limiting max_active to 1 for all workqueues?

Cross workqueue dependencies are not a problem per se, and a lot of
places in the kernel rely on them.  Dependencies on another work_struct
on the same workqueue on the other hand are a problem.

> Especially with lockdep, once we got a reproduce, it should not be that 
> hard to fix.

lockdep helps you to find them without requiring to limit the max_active.
Chris Mason March 24, 2023, 1:25 p.m. UTC | #14
On 3/23/23 9:09 PM, Christoph Hellwig wrote:
> On Thu, Mar 23, 2023 at 10:53:16AM -0400, Chris Mason wrote:
>> The original reason for limiting the number of workers was that work
>> like crc calculations was causing tons of context switches.  This isn't
>> a surprise, there were a ton of work items, and each one was processed
>> very quickly.
> 
> Yeah.  Although quite a bit of that went away since, by not doing
> the offload for sync I/O, not for metadata when there is a fast crc32
> implementation (we need to do the same for small data I/O and hardware
> accelerated sha256, btw), and by doing these in larger chunks, but ..
> 
>>
>> So the original btrfs thread pools had optimizations meant to limit
>> context switches by preferring to give work to a smaller number of workers.
> 
> .. this is something that the workqueue code already does under the hood
> anyway.

Yeah, between hardware changes and kernel changes, these motivations
don't really apply anymore.  Even if they did, we're probably better off
fixing that in the workqueue code directly now.

> 
>> For compression it's more clear cut.  I wanted the ability to saturate
>> all the CPUs with compression work, but also wanted a default that
>> didn't take over the whole machine.
> 
> And that's actually a very good use case.
> 
> It almost asks for a separate option just for compression, or at least
> for compression and checksumming only.
> 
> Is there consensus that for now we should limit thread_pool for 
> se workqueues that do compression and chekcsumming for now?  Then
> I'll send a series to wire it up for the read completion workqueue
> again that does the checksum verification, the compressed write
> workqueue, but drop it for all others but the write submission
> one?

As you mentioned above, we're currently doing synchronous crcs for
metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes.
We've benchmarked this a few times and I think with modern hardware a
better default is just always doing synchronous crcs for data too.

Qu or David have you looked synchronous data crcs recently?

Looking ahead, encryption is going to want similar knobs to compression.

My preference would be:

- crcs are always inline if your hardware is fast
- Compression, encryption, slow hardware crcs use the thread_pool_size knob
- We don't try to limit the other workers

The idea behind the knob is just "how much of your CPU should each btrfs
mount consume?"  Obviously we'll silently judge anyone that chooses less
than 100% but I guess it's good to give people options.

-chris
Chris Mason March 24, 2023, 7:20 p.m. UTC | #15
On 3/24/23 9:25 AM, Chris Mason wrote:
> On 3/23/23 9:09 PM, Christoph Hellwig wrote:

> As you mentioned above, we're currently doing synchronous crcs for
> metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes.
> We've benchmarked this a few times and I think with modern hardware a
> better default is just always doing synchronous crcs for data too.
> 

I asked Jens to run some numbers on his fast drives (thanks Jens!).  He
did 1M buffered writes w/fio.

Unpatched Linus:

write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets

Jens said there was a lot of variation during the run going down as low
as 1100MB/s.

Synchronous crcs:

write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets

The synchronous run had about the same peak write speed but much lower
dips, and fewer kworkers churning around.

Both tests had fio saturated at 100% CPU.

I think we should flip crcs to synchronous as long as it's HW accelerated.

-chris
hch@lst.de March 25, 2023, 8:13 a.m. UTC | #16
On Fri, Mar 24, 2023 at 03:20:47PM -0400, Chris Mason wrote:
> I asked Jens to run some numbers on his fast drives (thanks Jens!).  He
> did 1M buffered writes w/fio.
> 
> Unpatched Linus:
> 
> write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets
> 
> Jens said there was a lot of variation during the run going down as low
> as 1100MB/s.
> 
> Synchronous crcs:
> 
> write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets
> 
> The synchronous run had about the same peak write speed but much lower
> dips, and fewer kworkers churning around.

FYI, I did some similar runs on this just with kvm on consumer drives,
and the results were similar.  Interestingly enough I first accidentally
ran with the non-accelerated crc32 code, as kvm by default doesn't
pass through cpu flags.  Even with that the workqueue offload only
looked better for really large writes, and then only marginally.
hch@lst.de March 25, 2023, 8:15 a.m. UTC | #17
On Fri, Mar 24, 2023 at 09:25:07AM -0400, Chris Mason wrote:
> As you mentioned above, we're currently doing synchronous crcs for
> metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes.
> We've benchmarked this a few times and I think with modern hardware a
> better default is just always doing synchronous crcs for data too.
> 
> Qu or David have you looked synchronous data crcs recently?

As mentioend in the other mail I have a bit.  But only for crc32
so far, and only on x86, so the matrix might be a little more
complicated.

> My preference would be:
> 
> - crcs are always inline if your hardware is fast
> - Compression, encryption, slow hardware crcs use the thread_pool_size knob
> - We don't try to limit the other workers
> 
> The idea behind the knob is just "how much of your CPU should each btrfs
> mount consume?"  Obviously we'll silently judge anyone that chooses less
> than 100% but I guess it's good to give people options.

Ok.  I'll do a series about the nobs ASAP, and then the inline CRCs
next.
Qu Wenruo March 25, 2023, 8:42 a.m. UTC | #18
On 2023/3/25 16:15, Christoph Hellwig wrote:
> On Fri, Mar 24, 2023 at 09:25:07AM -0400, Chris Mason wrote:
>> As you mentioned above, we're currently doing synchronous crcs for
>> metadata when BTRFS_FS_CSUM_IMPL_FAST, and for synchronous writes.
>> We've benchmarked this a few times and I think with modern hardware a
>> better default is just always doing synchronous crcs for data too.
>>
>> Qu or David have you looked synchronous data crcs recently?
> 
> As mentioend in the other mail I have a bit.  But only for crc32
> so far, and only on x86, so the matrix might be a little more
> complicated.

Haven't really looked up the async csum part at all.

In fact I'm not even sure what's the point of generating data checksum 
asynchronously...

I didn't see any extra split to get multiple threads to calculate hash 
on different parts.
Furthermore, I'm not sure even if all the supported hash algos 
(CRC32/SHA256/BLAKE2/XXHASH) can support such split and merge 
multi-threading.
(IIRC SHA256 can not?)

Thus I'm not really sure of the benefit of async csum generation anyway.
(Not to mention asynchronous code itself is not straightforward)

Considering at least we used to do data csum verification in endio 
function synchronously, I see no problem to do the generation path 
synchronously, even without hardware accelerated checksum support.

Thanks,
Qu

> 
>> My preference would be:
>>
>> - crcs are always inline if your hardware is fast
>> - Compression, encryption, slow hardware crcs use the thread_pool_size knob
>> - We don't try to limit the other workers
>>
>> The idea behind the knob is just "how much of your CPU should each btrfs
>> mount consume?"  Obviously we'll silently judge anyone that chooses less
>> than 100% but I guess it's good to give people options.
> 
> Ok.  I'll do a series about the nobs ASAP, and then the inline CRCs
> next.
Chris Mason March 25, 2023, 5:16 p.m. UTC | #19
On 3/25/23 4:13 AM, Christoph Hellwig wrote:
> On Fri, Mar 24, 2023 at 03:20:47PM -0400, Chris Mason wrote:
>> I asked Jens to run some numbers on his fast drives (thanks Jens!).  He
>> did 1M buffered writes w/fio.
>>
>> Unpatched Linus:
>>
>> write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets
>>
>> Jens said there was a lot of variation during the run going down as low
>> as 1100MB/s.
>>
>> Synchronous crcs:
>>
>> write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets
>>
>> The synchronous run had about the same peak write speed but much lower
>> dips, and fewer kworkers churning around.
> 
> FYI, I did some similar runs on this just with kvm on consumer drives,
> and the results were similar.  Interestingly enough I first accidentally
> ran with the non-accelerated crc32 code, as kvm by default doesn't
> pass through cpu flags.  Even with that the workqueue offload only
> looked better for really large writes, and then only marginally.

There is pretty clearly an opportunity to tune the crc workers, but I
wouldn't make that a requirement for your series.  It'll be a bigger win
to just force inline most of the time.

-chris
diff mbox series

Patch

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 85539964864a34..037eded3b33ba2 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -300,20 +300,33 @@  static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_bio *bbio)
 {
 	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 
-	if (bbio->bio.bi_opf & REQ_META)
-		return fs_info->endio_meta_workers;
-	return fs_info->endio_workers;
+	if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE) {
+		if (btrfs_is_free_space_inode(bbio->inode))
+			return fs_info->endio_freespace_worker;
+		if (bbio->bio.bi_opf & REQ_META)
+			return fs_info->endio_write_meta_workers;
+		return fs_info->endio_write_workers;
+	} else {
+		if (bbio->bio.bi_opf & REQ_META)
+			return fs_info->endio_meta_workers;
+		return fs_info->endio_workers;
+	}
 }
 
-static void btrfs_end_bio_work(struct work_struct *work)
+static void btrfs_simple_end_bio_work(struct work_struct *work)
 {
 	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
+	struct bio *bio = &bbio->bio;
+	struct btrfs_device *dev = bio->bi_private;
 
 	/* Metadata reads are checked and repaired by the submitter. */
-	if (bbio->bio.bi_opf & REQ_META)
-		bbio->end_io(bbio);
-	else
-		btrfs_check_read_bio(bbio, bbio->bio.bi_private);
+	if (bio_op(bio) == REQ_OP_READ && !(bio->bi_opf & REQ_META)) {
+		btrfs_check_read_bio(bbio, dev);
+	} else {
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+			btrfs_record_physical_zoned(bbio);
+		btrfs_orig_bbio_end_io(bbio);
+	}
 }
 
 static void btrfs_simple_end_io(struct bio *bio)
@@ -322,18 +335,19 @@  static void btrfs_simple_end_io(struct bio *bio)
 	struct btrfs_device *dev = bio->bi_private;
 
 	btrfs_bio_counter_dec(bbio->inode->root->fs_info);
-
 	if (bio->bi_status)
 		btrfs_log_dev_io_error(bio, dev);
 
-	if (bio_op(bio) == REQ_OP_READ) {
-		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
-		queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
-	} else {
-		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
-			btrfs_record_physical_zoned(bbio);
-		btrfs_orig_bbio_end_io(bbio);
-	}
+	INIT_WORK(&bbio->end_io_work, btrfs_simple_end_bio_work);
+	queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
+}
+
+static void btrfs_write_end_bio_work(struct work_struct *work)
+{
+	struct btrfs_bio *bbio =
+		container_of(work, struct btrfs_bio, end_io_work);
+
+	btrfs_orig_bbio_end_io(bbio);
 }
 
 static void btrfs_raid56_end_io(struct bio *bio)
@@ -343,12 +357,23 @@  static void btrfs_raid56_end_io(struct bio *bio)
 
 	btrfs_bio_counter_dec(bioc->fs_info);
 	bbio->mirror_num = bioc->mirror_num;
-	if (bio_op(bio) == REQ_OP_READ && !(bbio->bio.bi_opf & REQ_META))
-		btrfs_check_read_bio(bbio, NULL);
-	else
-		btrfs_orig_bbio_end_io(bbio);
-
 	btrfs_put_bioc(bioc);
+
+	/*
+	 * The RAID5/6 code already completes all bios from workqueues, but for
+	 * writs we need to offload them again to avoid deadlocks in the ordered
+	 * extent processing.
+	 */
+	if (bio_op(bio) == REQ_OP_READ) {
+		/* Metadata reads are checked and repaired by the submitter. */
+		if (!(bio->bi_opf & REQ_META))
+			btrfs_check_read_bio(bbio, NULL);
+		else
+			btrfs_orig_bbio_end_io(bbio);
+	} else {
+		INIT_WORK(&btrfs_bio(bio)->end_io_work, btrfs_write_end_bio_work);
+		queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
+	}
 }
 
 static void btrfs_orig_write_end_io(struct bio *bio)
@@ -372,9 +397,10 @@  static void btrfs_orig_write_end_io(struct bio *bio)
 		bio->bi_status = BLK_STS_IOERR;
 	else
 		bio->bi_status = BLK_STS_OK;
-
-	btrfs_orig_bbio_end_io(bbio);
 	btrfs_put_bioc(bioc);
+
+	INIT_WORK(&bbio->end_io_work, btrfs_write_end_bio_work);
+	queue_work(btrfs_end_io_wq(bbio), &bbio->end_io_work);
 }
 
 static void btrfs_clone_write_end_io(struct bio *bio)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0889eb81e71a7d..5b63a5161cedea 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2008,6 +2008,8 @@  static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	 * the queues used for metadata I/O, since tasks from those other work
 	 * queues can do metadata I/O operations.
 	 */
+	if (fs_info->endio_write_meta_workers)
+		destroy_workqueue(fs_info->endio_write_meta_workers);
 	if (fs_info->endio_meta_workers)
 		destroy_workqueue(fs_info->endio_meta_workers);
 }
@@ -2204,6 +2206,8 @@  static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 		alloc_workqueue("btrfs-endio", flags, max_active);
 	fs_info->endio_meta_workers =
 		alloc_workqueue("btrfs-endio-meta", flags, max_active);
+	fs_info->endio_write_meta_workers =
+		alloc_workqueue("btrfs-endio-write-meta", flags, max_active);
 	fs_info->rmw_workers = alloc_workqueue("btrfs-rmw", flags, max_active);
 	fs_info->endio_write_workers =
 		alloc_workqueue("btrfs-endio-write", flags, max_active);
@@ -2224,6 +2228,7 @@  static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 	      fs_info->endio_workers && fs_info->endio_meta_workers &&
 	      fs_info->compressed_write_workers &&
 	      fs_info->endio_write_workers &&
+	      fs_info->endio_write_meta_workers &&
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->fixup_workers &&
 	      fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 276a17780f2b1b..e2643bc5c039ad 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -543,6 +543,7 @@  struct btrfs_fs_info {
 	struct workqueue_struct *rmw_workers;
 	struct workqueue_struct *compressed_write_workers;
 	struct workqueue_struct *endio_write_workers;
+	struct workqueue_struct *endio_write_meta_workers;
 	struct workqueue_struct *endio_freespace_worker;
 	struct btrfs_workqueue *caching_workers;
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 23f496f0d7b776..48c7510df80a0d 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -303,14 +303,6 @@  void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 	spin_unlock_irq(&tree->lock);
 }
 
-static void finish_ordered_fn(struct work_struct *work)
-{
-	struct btrfs_ordered_extent *ordered_extent;
-
-	ordered_extent = container_of(work, struct btrfs_ordered_extent, work);
-	btrfs_finish_ordered_io(ordered_extent);
-}
-
 /*
  * Mark all ordered extents io inside the specified range finished.
  *
@@ -330,17 +322,11 @@  void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 {
 	struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct workqueue_struct *wq;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
 	unsigned long flags;
 	u64 cur = file_offset;
 
-	if (btrfs_is_free_space_inode(inode))
-		wq = fs_info->endio_freespace_worker;
-	else
-		wq = fs_info->endio_write_workers;
-
 	if (page)
 		ASSERT(page->mapping && page_offset(page) <= file_offset &&
 		       file_offset + num_bytes <= page_offset(page) + PAGE_SIZE);
@@ -439,8 +425,7 @@  void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 			refcount_inc(&entry->refs);
 			trace_btrfs_ordered_extent_mark_finished(inode, entry);
 			spin_unlock_irqrestore(&tree->lock, flags);
-			INIT_WORK(&entry->work, finish_ordered_fn);
-			queue_work(wq, &entry->work);
+			btrfs_finish_ordered_io(entry);
 			spin_lock_irqsave(&tree->lock, flags);
 		}
 		cur += len;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index b8a92f040458f0..64a37d42e7a24a 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -146,8 +146,6 @@  struct btrfs_ordered_extent {
 	/* a per root list of all the pending ordered extents */
 	struct list_head root_extent_list;
 
-	struct work_struct work;
-
 	struct completion completion;
 	struct btrfs_work flush_work;
 	struct list_head work_list;