diff mbox series

[08/13] btrfs: Use alloc_ordered_workqueue() to create ordered workqueues

Message ID 20230509015032.3768622-9-tj@kernel.org (mailing list archive)
State New, archived
Headers show
Series None | expand

Commit Message

Tejun Heo May 9, 2023, 1:50 a.m. UTC
BACKGROUND
==========

When multiple work items are queued to a workqueue, their execution order
doesn't match the queueing order. They may get executed in any order and
simultaneously. When fully serialized execution - one by one in the queueing
order - is needed, an ordered workqueue should be used which can be created
with alloc_ordered_workqueue().

However, alloc_ordered_workqueue() was a later addition. Before it, an
ordered workqueue could be obtained by creating an UNBOUND workqueue with
@max_active==1. This originally was an implementation side-effect which was
broken by 4c16bd327c74 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered"). Because there were users that depended on the ordered execution,
5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
made workqueue allocation path to implicitly promote UNBOUND workqueues w/
@max_active==1 to ordered workqueues.

While this has worked okay, overloading the UNBOUND allocation interface
this way creates other issues. It's difficult to tell whether a given
workqueue actually needs to be ordered and users that legitimately want a
min concurrency level wq unexpectedly gets an ordered one instead. With
planned UNBOUND workqueue updates to improve execution locality and more
prevalence of chiplet designs which can benefit from such improvements, this
isn't a state we wanna be in forever.

This patch series audits all callsites that create an UNBOUND workqueue w/
@max_active==1 and converts them to alloc_ordered_workqueue() as necessary.

WHAT TO LOOK FOR
================

The conversions are from

  alloc_workqueue(WQ_UNBOUND | flags, 1, args..)

to

  alloc_ordered_workqueue(flags, args...)

which don't cause any functional changes. If you know that fully ordered
execution is not ncessary, please let me know. I'll drop the conversion and
instead add a comment noting the fact to reduce confusion while conversion
is in progress.

If you aren't fully sure, it's completely fine to let the conversion
through. The behavior will stay exactly the same and we can always
reconsider later.

As there are follow-up workqueue core changes, I'd really appreciate if the
patch can be routed through the workqueue tree w/ your acks. Thanks.

v3: btrfs_alloc_workqueue() excluded again. The concurrency level of a
    workqueue allocated through btrfs_alloc_workqueue() may be dynamically
    adjusted through thresh_exec_hook(), so they can't be ordered.

v2: btrfs_alloc_workqueue() updated too as suggested by Wang.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
---
 fs/btrfs/disk-io.c | 2 +-
 fs/btrfs/scrub.c   | 6 ++++--
 2 files changed, 5 insertions(+), 3 deletions(-)

Comments

David Sterba May 9, 2023, 2:53 p.m. UTC | #1
On Mon, May 08, 2023 at 03:50:27PM -1000, Tejun Heo wrote:
> BACKGROUND
> ==========
> 
> When multiple work items are queued to a workqueue, their execution order
> doesn't match the queueing order. They may get executed in any order and
> simultaneously. When fully serialized execution - one by one in the queueing
> order - is needed, an ordered workqueue should be used which can be created
> with alloc_ordered_workqueue().
> 
> However, alloc_ordered_workqueue() was a later addition. Before it, an
> ordered workqueue could be obtained by creating an UNBOUND workqueue with
> @max_active==1. This originally was an implementation side-effect which was
> broken by 4c16bd327c74 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
> ordered"). Because there were users that depended on the ordered execution,
> 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
> made workqueue allocation path to implicitly promote UNBOUND workqueues w/
> @max_active==1 to ordered workqueues.
> 
> While this has worked okay, overloading the UNBOUND allocation interface
> this way creates other issues. It's difficult to tell whether a given
> workqueue actually needs to be ordered and users that legitimately want a
> min concurrency level wq unexpectedly gets an ordered one instead. With
> planned UNBOUND workqueue updates to improve execution locality and more
> prevalence of chiplet designs which can benefit from such improvements, this
> isn't a state we wanna be in forever.
> 
> This patch series audits all callsites that create an UNBOUND workqueue w/
> @max_active==1 and converts them to alloc_ordered_workqueue() as necessary.
> 
> WHAT TO LOOK FOR
> ================
> 
> The conversions are from
> 
>   alloc_workqueue(WQ_UNBOUND | flags, 1, args..)
> 
> to
> 
>   alloc_ordered_workqueue(flags, args...)
> 
> which don't cause any functional changes. If you know that fully ordered
> execution is not ncessary, please let me know. I'll drop the conversion and
> instead add a comment noting the fact to reduce confusion while conversion
> is in progress.
> 
> If you aren't fully sure, it's completely fine to let the conversion
> through. The behavior will stay exactly the same and we can always
> reconsider later.
> 
> As there are follow-up workqueue core changes, I'd really appreciate if the
> patch can be routed through the workqueue tree w/ your acks. Thanks.
> 
> v3: btrfs_alloc_workqueue() excluded again. The concurrency level of a
>     workqueue allocated through btrfs_alloc_workqueue() may be dynamically
>     adjusted through thresh_exec_hook(), so they can't be ordered.
> 
> v2: btrfs_alloc_workqueue() updated too as suggested by Wang.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Wang Yugui <wangyugui@e16-tech.com>
> Cc: Chris Mason <clm@fb.com>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: David Sterba <dsterba@suse.com>
> Cc: linux-btrfs@vger.kernel.org
> ---
>  fs/btrfs/disk-io.c | 2 +-
>  fs/btrfs/scrub.c   | 6 ++++--
>  2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 59ea049fe7ee..32d08aed88b6 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2217,7 +2217,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
>  	fs_info->qgroup_rescan_workers =
>  		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
>  	fs_info->discard_ctl.discard_workers =
> -		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
> +		alloc_ordered_workqueue("btrfs_discard", WQ_FREEZABLE);
>  
>  	if (!(fs_info->workers && fs_info->hipri_workers &&
>  	      fs_info->delalloc_workers && fs_info->flush_workers &&

I think there are a few more conversions missing. There's a local flags
variable in btrfs_init_workqueues

2175 static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
2176 {
2177         u32 max_active = fs_info->thread_pool_size;
2178         unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND;

And used like

2194         fs_info->fixup_workers =
2195                 btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0);

2213         fs_info->qgroup_rescan_workers =
2214                 btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);

WQ_UNBOUND is not mentioned explicitliy like for the "btrfs_discard"
workqueue.  Patch v2 did the switch in btrfs_alloc_workqueue according
to the max_active/limit_active parameter but this would affect all
queues and not all of them require to be ordered.

In btrfs_resize_thread_pool the workqueue_set_max_active is called
directly or indirectly so this can set the max_active to a user-defined
mount option. Could this be a problem or trigger a warning? This would
lead to max_active==1 + WQ_UNBOUND.
Tejun Heo May 9, 2023, 3:57 p.m. UTC | #2
Hello, David.

Thanks for taking a look.

On Tue, May 09, 2023 at 04:53:32PM +0200, David Sterba wrote:
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 59ea049fe7ee..32d08aed88b6 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -2217,7 +2217,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
> >  	fs_info->qgroup_rescan_workers =
> >  		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
> >  	fs_info->discard_ctl.discard_workers =
> > -		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
> > +		alloc_ordered_workqueue("btrfs_discard", WQ_FREEZABLE);
> >  
> >  	if (!(fs_info->workers && fs_info->hipri_workers &&
> >  	      fs_info->delalloc_workers && fs_info->flush_workers &&
> 
> I think there are a few more conversions missing. There's a local flags
> variable in btrfs_init_workqueues
> 
> 2175 static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
> 2176 {
> 2177         u32 max_active = fs_info->thread_pool_size;
> 2178         unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND;
> 
> And used like
> 
> 2194         fs_info->fixup_workers =
> 2195                 btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0);
> 
> 2213         fs_info->qgroup_rescan_workers =
> 2214                 btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);

Right you are.

> WQ_UNBOUND is not mentioned explicitliy like for the "btrfs_discard"
> workqueue.  Patch v2 did the switch in btrfs_alloc_workqueue according
> to the max_active/limit_active parameter but this would affect all
> queues and not all of them require to be ordered.

The thresh mechanism which auto adjusts max active means that the workqueues
allocated btrfs_alloc_workqueue() can't be ordered, right? When thresh is
smaller than DFT_THRESHOLD, the mechanism is disabled but that looks like an
optimization.

> In btrfs_resize_thread_pool the workqueue_set_max_active is called
> directly or indirectly so this can set the max_active to a user-defined
> mount option. Could this be a problem or trigger a warning? This would
> lead to max_active==1 + WQ_UNBOUND.

That's not a problem. The only thing we need to make sure is that the
workqueues which actually *must* be ordered use alloc_ordered_workqueue() as
they won't be implicitly treated as ordered in the future.

* The current patch converts two - fs_info->discard_ctl.discard_workers and
  scrub_workers when @is_dev_replace is set. Do they actually need to be
  ordered?

* As you pointed out, fs_info->fixup_workers and
  fs_info->qgroup_rescan_workers are also currently implicitly ordered. Do
  they actually need to be ordered?

Thanks.
David Sterba May 9, 2023, 11:36 p.m. UTC | #3
On Tue, May 09, 2023 at 05:57:16AM -1000, Tejun Heo wrote:
> Hello, David.
> 
> Thanks for taking a look.
> 
> On Tue, May 09, 2023 at 04:53:32PM +0200, David Sterba wrote:
> > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > > index 59ea049fe7ee..32d08aed88b6 100644
> > > --- a/fs/btrfs/disk-io.c
> > > +++ b/fs/btrfs/disk-io.c
> > > @@ -2217,7 +2217,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
> > >  	fs_info->qgroup_rescan_workers =
> > >  		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
> > >  	fs_info->discard_ctl.discard_workers =
> > > -		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
> > > +		alloc_ordered_workqueue("btrfs_discard", WQ_FREEZABLE);
> > >  
> > >  	if (!(fs_info->workers && fs_info->hipri_workers &&
> > >  	      fs_info->delalloc_workers && fs_info->flush_workers &&
> > 
> > I think there are a few more conversions missing. There's a local flags
> > variable in btrfs_init_workqueues
> > 
> > 2175 static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
> > 2176 {
> > 2177         u32 max_active = fs_info->thread_pool_size;
> > 2178         unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND;
> > 
> > And used like
> > 
> > 2194         fs_info->fixup_workers =
> > 2195                 btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0);
> > 
> > 2213         fs_info->qgroup_rescan_workers =
> > 2214                 btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
> 
> Right you are.
> 
> > WQ_UNBOUND is not mentioned explicitliy like for the "btrfs_discard"
> > workqueue.  Patch v2 did the switch in btrfs_alloc_workqueue according
> > to the max_active/limit_active parameter but this would affect all
> > queues and not all of them require to be ordered.
> 
> The thresh mechanism which auto adjusts max active means that the workqueues
> allocated btrfs_alloc_workqueue() can't be ordered, right? When thresh is
> smaller than DFT_THRESHOLD, the mechanism is disabled but that looks like an
> optimization.

Yeah I think so but I'm not entierly sure. The ordering for all queues
that don't start with max_active > 1 should not be required, here the
parallelization and out of order processing is expected and serialized
or decided once the work is done.

> > In btrfs_resize_thread_pool the workqueue_set_max_active is called
> > directly or indirectly so this can set the max_active to a user-defined
> > mount option. Could this be a problem or trigger a warning? This would
> > lead to max_active==1 + WQ_UNBOUND.
> 
> That's not a problem. The only thing we need to make sure is that the
> workqueues which actually *must* be ordered use alloc_ordered_workqueue() as
> they won't be implicitly treated as ordered in the future.
> 
> * The current patch converts two - fs_info->discard_ctl.discard_workers and
>   scrub_workers when @is_dev_replace is set. Do they actually need to be
>   ordered?
> 
> * As you pointed out, fs_info->fixup_workers and
>   fs_info->qgroup_rescan_workers are also currently implicitly ordered. Do
>   they actually need to be ordered?

I think all of them somehow implictly depend on the ordering. The
replace process sequentially goes over a block group and copies blocks.

The fixup process is quite obscure and we should preserve the semantics
as much as possible. It has something to do with pages that get out of
sync with extent state without btrfs knowing and that there are more such
requests hapenning at the same time is low but once it happens it can
lead to corruptions.

Quota rescan is in its nature also a sequential process but I think it
does not need to be ordered, it's started from higher level context like
enabling quotas or rescan but there are also calls at remount time so
this makes it less clear.

In summary, if the ordered queue could be used then I'd recommend to do
it as the safe option.
Tejun Heo May 9, 2023, 11:47 p.m. UTC | #4
Hello, David.

On Wed, May 10, 2023 at 01:36:20AM +0200, David Sterba wrote:
...
> Yeah I think so but I'm not entierly sure. The ordering for all queues
> that don't start with max_active > 1 should not be required, here the
> parallelization and out of order processing is expected and serialized
> or decided once the work is done.
> 
> > > In btrfs_resize_thread_pool the workqueue_set_max_active is called
> > > directly or indirectly so this can set the max_active to a user-defined
> > > mount option. Could this be a problem or trigger a warning? This would
> > > lead to max_active==1 + WQ_UNBOUND.
> > 
> > That's not a problem. The only thing we need to make sure is that the
> > workqueues which actually *must* be ordered use alloc_ordered_workqueue() as
> > they won't be implicitly treated as ordered in the future.
> > 
> > * The current patch converts two - fs_info->discard_ctl.discard_workers and
> >   scrub_workers when @is_dev_replace is set. Do they actually need to be
> >   ordered?
> > 
> > * As you pointed out, fs_info->fixup_workers and
> >   fs_info->qgroup_rescan_workers are also currently implicitly ordered. Do
> >   they actually need to be ordered?
> 
> I think all of them somehow implictly depend on the ordering. The
> replace process sequentially goes over a block group and copies blocks.
> 
> The fixup process is quite obscure and we should preserve the semantics
> as much as possible. It has something to do with pages that get out of
> sync with extent state without btrfs knowing and that there are more such
> requests hapenning at the same time is low but once it happens it can
> lead to corruptions.
> 
> Quota rescan is in its nature also a sequential process but I think it
> does not need to be ordered, it's started from higher level context like
> enabling quotas or rescan but there are also calls at remount time so
> this makes it less clear.
> 
> In summary, if the ordered queue could be used then I'd recommend to do
> it as the safe option.

I see. It seems rather error-prone to make workqueues implicitly ordered
from btrfs_alloc_workqueue(). I'll see if I can make it explicit and keep
all workqueues which are currently guaranteed to be ordered ordered.

Thanks.
diff mbox series

Patch

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 59ea049fe7ee..32d08aed88b6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2217,7 +2217,7 @@  static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 	fs_info->qgroup_rescan_workers =
 		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
 	fs_info->discard_ctl.discard_workers =
-		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
+		alloc_ordered_workqueue("btrfs_discard", WQ_FREEZABLE);
 
 	if (!(fs_info->workers && fs_info->hipri_workers &&
 	      fs_info->delalloc_workers && fs_info->flush_workers &&
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 836725a19661..d5401d7813a5 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2734,8 +2734,10 @@  static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info,
 	if (refcount_inc_not_zero(&fs_info->scrub_workers_refcnt))
 		return 0;
 
-	scrub_workers = alloc_workqueue("btrfs-scrub", flags,
-					is_dev_replace ? 1 : max_active);
+	if (is_dev_replace)
+		scrub_workers = alloc_ordered_workqueue("btrfs-scrub", flags);
+	else
+		scrub_workers = alloc_workqueue("btrfs-scrub", flags, max_active);
 	if (!scrub_workers)
 		goto fail_scrub_workers;