Message ID | 20230509015032.3768622-9-tj@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | None | expand |
On Mon, May 08, 2023 at 03:50:27PM -1000, Tejun Heo wrote: > BACKGROUND > ========== > > When multiple work items are queued to a workqueue, their execution order > doesn't match the queueing order. They may get executed in any order and > simultaneously. When fully serialized execution - one by one in the queueing > order - is needed, an ordered workqueue should be used which can be created > with alloc_ordered_workqueue(). > > However, alloc_ordered_workqueue() was a later addition. Before it, an > ordered workqueue could be obtained by creating an UNBOUND workqueue with > @max_active==1. This originally was an implementation side-effect which was > broken by 4c16bd327c74 ("workqueue: restore WQ_UNBOUND/max_active==1 to be > ordered"). Because there were users that depended on the ordered execution, > 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") > made workqueue allocation path to implicitly promote UNBOUND workqueues w/ > @max_active==1 to ordered workqueues. > > While this has worked okay, overloading the UNBOUND allocation interface > this way creates other issues. It's difficult to tell whether a given > workqueue actually needs to be ordered and users that legitimately want a > min concurrency level wq unexpectedly gets an ordered one instead. With > planned UNBOUND workqueue updates to improve execution locality and more > prevalence of chiplet designs which can benefit from such improvements, this > isn't a state we wanna be in forever. > > This patch series audits all callsites that create an UNBOUND workqueue w/ > @max_active==1 and converts them to alloc_ordered_workqueue() as necessary. > > WHAT TO LOOK FOR > ================ > > The conversions are from > > alloc_workqueue(WQ_UNBOUND | flags, 1, args..) > > to > > alloc_ordered_workqueue(flags, args...) > > which don't cause any functional changes. If you know that fully ordered > execution is not ncessary, please let me know. I'll drop the conversion and > instead add a comment noting the fact to reduce confusion while conversion > is in progress. > > If you aren't fully sure, it's completely fine to let the conversion > through. The behavior will stay exactly the same and we can always > reconsider later. > > As there are follow-up workqueue core changes, I'd really appreciate if the > patch can be routed through the workqueue tree w/ your acks. Thanks. > > v3: btrfs_alloc_workqueue() excluded again. The concurrency level of a > workqueue allocated through btrfs_alloc_workqueue() may be dynamically > adjusted through thresh_exec_hook(), so they can't be ordered. > > v2: btrfs_alloc_workqueue() updated too as suggested by Wang. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Cc: Wang Yugui <wangyugui@e16-tech.com> > Cc: Chris Mason <clm@fb.com> > Cc: Josef Bacik <josef@toxicpanda.com> > Cc: David Sterba <dsterba@suse.com> > Cc: linux-btrfs@vger.kernel.org > --- > fs/btrfs/disk-io.c | 2 +- > fs/btrfs/scrub.c | 6 ++++-- > 2 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 59ea049fe7ee..32d08aed88b6 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2217,7 +2217,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info) > fs_info->qgroup_rescan_workers = > btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0); > fs_info->discard_ctl.discard_workers = > - alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1); > + alloc_ordered_workqueue("btrfs_discard", WQ_FREEZABLE); > > if (!(fs_info->workers && fs_info->hipri_workers && > fs_info->delalloc_workers && fs_info->flush_workers && I think there are a few more conversions missing. There's a local flags variable in btrfs_init_workqueues 2175 static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info) 2176 { 2177 u32 max_active = fs_info->thread_pool_size; 2178 unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND; And used like 2194 fs_info->fixup_workers = 2195 btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0); 2213 fs_info->qgroup_rescan_workers = 2214 btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0); WQ_UNBOUND is not mentioned explicitliy like for the "btrfs_discard" workqueue. Patch v2 did the switch in btrfs_alloc_workqueue according to the max_active/limit_active parameter but this would affect all queues and not all of them require to be ordered. In btrfs_resize_thread_pool the workqueue_set_max_active is called directly or indirectly so this can set the max_active to a user-defined mount option. Could this be a problem or trigger a warning? This would lead to max_active==1 + WQ_UNBOUND.
Hello, David. Thanks for taking a look. On Tue, May 09, 2023 at 04:53:32PM +0200, David Sterba wrote: > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > > index 59ea049fe7ee..32d08aed88b6 100644 > > --- a/fs/btrfs/disk-io.c > > +++ b/fs/btrfs/disk-io.c > > @@ -2217,7 +2217,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info) > > fs_info->qgroup_rescan_workers = > > btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0); > > fs_info->discard_ctl.discard_workers = > > - alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1); > > + alloc_ordered_workqueue("btrfs_discard", WQ_FREEZABLE); > > > > if (!(fs_info->workers && fs_info->hipri_workers && > > fs_info->delalloc_workers && fs_info->flush_workers && > > I think there are a few more conversions missing. There's a local flags > variable in btrfs_init_workqueues > > 2175 static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info) > 2176 { > 2177 u32 max_active = fs_info->thread_pool_size; > 2178 unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND; > > And used like > > 2194 fs_info->fixup_workers = > 2195 btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0); > > 2213 fs_info->qgroup_rescan_workers = > 2214 btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0); Right you are. > WQ_UNBOUND is not mentioned explicitliy like for the "btrfs_discard" > workqueue. Patch v2 did the switch in btrfs_alloc_workqueue according > to the max_active/limit_active parameter but this would affect all > queues and not all of them require to be ordered. The thresh mechanism which auto adjusts max active means that the workqueues allocated btrfs_alloc_workqueue() can't be ordered, right? When thresh is smaller than DFT_THRESHOLD, the mechanism is disabled but that looks like an optimization. > In btrfs_resize_thread_pool the workqueue_set_max_active is called > directly or indirectly so this can set the max_active to a user-defined > mount option. Could this be a problem or trigger a warning? This would > lead to max_active==1 + WQ_UNBOUND. That's not a problem. The only thing we need to make sure is that the workqueues which actually *must* be ordered use alloc_ordered_workqueue() as they won't be implicitly treated as ordered in the future. * The current patch converts two - fs_info->discard_ctl.discard_workers and scrub_workers when @is_dev_replace is set. Do they actually need to be ordered? * As you pointed out, fs_info->fixup_workers and fs_info->qgroup_rescan_workers are also currently implicitly ordered. Do they actually need to be ordered? Thanks.
On Tue, May 09, 2023 at 05:57:16AM -1000, Tejun Heo wrote: > Hello, David. > > Thanks for taking a look. > > On Tue, May 09, 2023 at 04:53:32PM +0200, David Sterba wrote: > > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > > > index 59ea049fe7ee..32d08aed88b6 100644 > > > --- a/fs/btrfs/disk-io.c > > > +++ b/fs/btrfs/disk-io.c > > > @@ -2217,7 +2217,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info) > > > fs_info->qgroup_rescan_workers = > > > btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0); > > > fs_info->discard_ctl.discard_workers = > > > - alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1); > > > + alloc_ordered_workqueue("btrfs_discard", WQ_FREEZABLE); > > > > > > if (!(fs_info->workers && fs_info->hipri_workers && > > > fs_info->delalloc_workers && fs_info->flush_workers && > > > > I think there are a few more conversions missing. There's a local flags > > variable in btrfs_init_workqueues > > > > 2175 static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info) > > 2176 { > > 2177 u32 max_active = fs_info->thread_pool_size; > > 2178 unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND; > > > > And used like > > > > 2194 fs_info->fixup_workers = > > 2195 btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0); > > > > 2213 fs_info->qgroup_rescan_workers = > > 2214 btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0); > > Right you are. > > > WQ_UNBOUND is not mentioned explicitliy like for the "btrfs_discard" > > workqueue. Patch v2 did the switch in btrfs_alloc_workqueue according > > to the max_active/limit_active parameter but this would affect all > > queues and not all of them require to be ordered. > > The thresh mechanism which auto adjusts max active means that the workqueues > allocated btrfs_alloc_workqueue() can't be ordered, right? When thresh is > smaller than DFT_THRESHOLD, the mechanism is disabled but that looks like an > optimization. Yeah I think so but I'm not entierly sure. The ordering for all queues that don't start with max_active > 1 should not be required, here the parallelization and out of order processing is expected and serialized or decided once the work is done. > > In btrfs_resize_thread_pool the workqueue_set_max_active is called > > directly or indirectly so this can set the max_active to a user-defined > > mount option. Could this be a problem or trigger a warning? This would > > lead to max_active==1 + WQ_UNBOUND. > > That's not a problem. The only thing we need to make sure is that the > workqueues which actually *must* be ordered use alloc_ordered_workqueue() as > they won't be implicitly treated as ordered in the future. > > * The current patch converts two - fs_info->discard_ctl.discard_workers and > scrub_workers when @is_dev_replace is set. Do they actually need to be > ordered? > > * As you pointed out, fs_info->fixup_workers and > fs_info->qgroup_rescan_workers are also currently implicitly ordered. Do > they actually need to be ordered? I think all of them somehow implictly depend on the ordering. The replace process sequentially goes over a block group and copies blocks. The fixup process is quite obscure and we should preserve the semantics as much as possible. It has something to do with pages that get out of sync with extent state without btrfs knowing and that there are more such requests hapenning at the same time is low but once it happens it can lead to corruptions. Quota rescan is in its nature also a sequential process but I think it does not need to be ordered, it's started from higher level context like enabling quotas or rescan but there are also calls at remount time so this makes it less clear. In summary, if the ordered queue could be used then I'd recommend to do it as the safe option.
Hello, David. On Wed, May 10, 2023 at 01:36:20AM +0200, David Sterba wrote: ... > Yeah I think so but I'm not entierly sure. The ordering for all queues > that don't start with max_active > 1 should not be required, here the > parallelization and out of order processing is expected and serialized > or decided once the work is done. > > > > In btrfs_resize_thread_pool the workqueue_set_max_active is called > > > directly or indirectly so this can set the max_active to a user-defined > > > mount option. Could this be a problem or trigger a warning? This would > > > lead to max_active==1 + WQ_UNBOUND. > > > > That's not a problem. The only thing we need to make sure is that the > > workqueues which actually *must* be ordered use alloc_ordered_workqueue() as > > they won't be implicitly treated as ordered in the future. > > > > * The current patch converts two - fs_info->discard_ctl.discard_workers and > > scrub_workers when @is_dev_replace is set. Do they actually need to be > > ordered? > > > > * As you pointed out, fs_info->fixup_workers and > > fs_info->qgroup_rescan_workers are also currently implicitly ordered. Do > > they actually need to be ordered? > > I think all of them somehow implictly depend on the ordering. The > replace process sequentially goes over a block group and copies blocks. > > The fixup process is quite obscure and we should preserve the semantics > as much as possible. It has something to do with pages that get out of > sync with extent state without btrfs knowing and that there are more such > requests hapenning at the same time is low but once it happens it can > lead to corruptions. > > Quota rescan is in its nature also a sequential process but I think it > does not need to be ordered, it's started from higher level context like > enabling quotas or rescan but there are also calls at remount time so > this makes it less clear. > > In summary, if the ordered queue could be used then I'd recommend to do > it as the safe option. I see. It seems rather error-prone to make workqueues implicitly ordered from btrfs_alloc_workqueue(). I'll see if I can make it explicit and keep all workqueues which are currently guaranteed to be ordered ordered. Thanks.
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 59ea049fe7ee..32d08aed88b6 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2217,7 +2217,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info) fs_info->qgroup_rescan_workers = btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0); fs_info->discard_ctl.discard_workers = - alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1); + alloc_ordered_workqueue("btrfs_discard", WQ_FREEZABLE); if (!(fs_info->workers && fs_info->hipri_workers && fs_info->delalloc_workers && fs_info->flush_workers && diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 836725a19661..d5401d7813a5 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2734,8 +2734,10 @@ static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info, if (refcount_inc_not_zero(&fs_info->scrub_workers_refcnt)) return 0; - scrub_workers = alloc_workqueue("btrfs-scrub", flags, - is_dev_replace ? 1 : max_active); + if (is_dev_replace) + scrub_workers = alloc_ordered_workqueue("btrfs-scrub", flags); + else + scrub_workers = alloc_workqueue("btrfs-scrub", flags, max_active); if (!scrub_workers) goto fail_scrub_workers;
BACKGROUND ========== When multiple work items are queued to a workqueue, their execution order doesn't match the queueing order. They may get executed in any order and simultaneously. When fully serialized execution - one by one in the queueing order - is needed, an ordered workqueue should be used which can be created with alloc_ordered_workqueue(). However, alloc_ordered_workqueue() was a later addition. Before it, an ordered workqueue could be obtained by creating an UNBOUND workqueue with @max_active==1. This originally was an implementation side-effect which was broken by 4c16bd327c74 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered"). Because there were users that depended on the ordered execution, 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") made workqueue allocation path to implicitly promote UNBOUND workqueues w/ @max_active==1 to ordered workqueues. While this has worked okay, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue updates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. This patch series audits all callsites that create an UNBOUND workqueue w/ @max_active==1 and converts them to alloc_ordered_workqueue() as necessary. WHAT TO LOOK FOR ================ The conversions are from alloc_workqueue(WQ_UNBOUND | flags, 1, args..) to alloc_ordered_workqueue(flags, args...) which don't cause any functional changes. If you know that fully ordered execution is not ncessary, please let me know. I'll drop the conversion and instead add a comment noting the fact to reduce confusion while conversion is in progress. If you aren't fully sure, it's completely fine to let the conversion through. The behavior will stay exactly the same and we can always reconsider later. As there are follow-up workqueue core changes, I'd really appreciate if the patch can be routed through the workqueue tree w/ your acks. Thanks. v3: btrfs_alloc_workqueue() excluded again. The concurrency level of a workqueue allocated through btrfs_alloc_workqueue() may be dynamically adjusted through thresh_exec_hook(), so they can't be ordered. v2: btrfs_alloc_workqueue() updated too as suggested by Wang. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: David Sterba <dsterba@suse.com> Cc: linux-btrfs@vger.kernel.org --- fs/btrfs/disk-io.c | 2 +- fs/btrfs/scrub.c | 6 ++++-- 2 files changed, 5 insertions(+), 3 deletions(-)