Message ID | 20230320210724.GB1434@sol.localdomain (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | [GIT,PULL] fsverity fixes for v6.3-rc4 | expand |
On Mon, Mar 20, 2023 at 2:07 PM Eric Biggers <ebiggers@kernel.org> wrote: > > Nathan Huckleberry (1): > fsverity: Remove WQ_UNBOUND from fsverity read workqueue There's a *lot* of other WQ_UNBOUND users. If it performs that badly, maybe there is something wrong with the workqueue code. Should people be warned to not use WQ_UNBOUND - or is there something very special about fsverity? Added Tejun to the cc. With one of the main documented reasons for WQ_UNBOUND being performance (both implicit "try to start execution of work items as soon as possible") and explicit ("CPU intensive workloads which can be better managed by the system scheduler"), maybe it's time to reconsider? WQ_UNBOUND adds a fair amount of complexity and special cases to the workqueues, and this is now the second "let's remove it because it's hurting things in a big way". Linus
The pull request you sent on Mon, 20 Mar 2023 14:07:24 -0700:
> https://git.kernel.org/pub/scm/fs/fsverity/linux.git tags/fsverity-for-linus
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/17214b70a159c6547df9ae204a6275d983146f6b
Thank you!
On Mon, Mar 20, 2023 at 03:31:13PM -0700, Linus Torvalds wrote: > On Mon, Mar 20, 2023 at 2:07 PM Eric Biggers <ebiggers@kernel.org> wrote: > > > > Nathan Huckleberry (1): > > fsverity: Remove WQ_UNBOUND from fsverity read workqueue > > There's a *lot* of other WQ_UNBOUND users. If it performs that badly, > maybe there is something wrong with the workqueue code. > > Should people be warned to not use WQ_UNBOUND - or is there something > very special about fsverity? > > Added Tejun to the cc. With one of the main documented reasons for > WQ_UNBOUND being performance (both implicit "try to start execution of > work items as soon as possible") and explicit ("CPU intensive > workloads which can be better managed by the system scheduler"), maybe > it's time to reconsider? > > WQ_UNBOUND adds a fair amount of complexity and special cases to the > workqueues, and this is now the second "let's remove it because it's > hurting things in a big way". > > Linus So, Nathan has been doing most of the investigation and testing on this, and he's out of office at the moment. But, my understanding is that since modern CPUs have acceleration for all the common crypto algorithms (including fsverity's SHA-256), the work items just don't take long enough for the overhead of a context switch to be worth it. WQ_UNBOUND seems to be optimized for much longer running work items. Additionally, the WQ_UNBOUND overhead is particularly bad on arm64. We aren't sure of the reason for this. Nathan thinks this is probably related to overhead of saving/restoring the FPU+SIMD state. My theory is that it's mainly caused by heterogeneous processors, where work that would ordinarily run on the fastest CPU core gets scheduled on a slow CPU core. Maybe it's a combination of both. WQ_UNBOUND has been shown to be detrimental to EROFS decompression and to dm-verity too, so this isn't specific to fsverity. (fscrypt is still under investigation. I'd guess the same applies, but it's been less of a priority since fscrypt doesn't use a workqueue when inline encryption is being used.) These are all "I/O post-processing cases", though, so all sort of similar. - Eric
Hello, (cc'ing Lai.) On Mon, Mar 20, 2023 at 03:31:13PM -0700, Linus Torvalds wrote: > On Mon, Mar 20, 2023 at 2:07 PM Eric Biggers <ebiggers@kernel.org> wrote: > > > > Nathan Huckleberry (1): > > fsverity: Remove WQ_UNBOUND from fsverity read workqueue > > There's a *lot* of other WQ_UNBOUND users. If it performs that badly, > maybe there is something wrong with the workqueue code. > > Should people be warned to not use WQ_UNBOUND - or is there something > very special about fsverity? > > Added Tejun to the cc. With one of the main documented reasons for > WQ_UNBOUND being performance (both implicit "try to start execution of > work items as soon as possible") and explicit ("CPU intensive > workloads which can be better managed by the system scheduler"), maybe > it's time to reconsider? > > WQ_UNBOUND adds a fair amount of complexity and special cases to the > workqueues, and this is now the second "let's remove it because it's > hurting things in a big way". Do you remember what the other case was? Was it also on heterogenous arm setup? There aren't many differences between unbound workqueues and percpu ones that aren't concurrency managed. If there are significant performance differences, it's unlikely to be directly from whatever workqueue is doing. One obvious thing that comes to mind is that WQ_UNBOUND may be pushing tasks across expensive cache boundaries (e.g. across cores that are living on separate L3 complexes). This isn't a totally new problem and workqueue has some topology awareness, by default, WQ_UNBOUND pools are segregated across NUMA boundaries. This used to be fine but I think it's likely outmoded now. given that non-trivial cache hierarchies on top of UMA or inside a node are a thing these days. Looking at f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity read workqueue"), I feel a bit uneasy. This would be fine on a setup which does moderate amount of IOs on CPUs with quick enough accelration mechanisms, but that's not the whole world. Use cases that generate extreme amount of IOs do depend on the ability to fan out IO related work items across multiple CPUs especially if the IOs coincide with network activities. So, my intuition is that the commit is fixing a subset of use cases while likely regressing others. If the cache theory is correct, the right thing to do would be making workqueue init code a bit smarter so that it segements unbound pools on LLC boundaries rather than NUMA, which would make more sense on recent AMD chips too. Nathan, can you run `hwloc-ls` on the affected setup (or `lstopo out.pdf`) and attach the output? As for the overhead of supporting WQ_UNBOUND, it does add non-trivial amount of complexity but of the boring kind. It's all managerial stuff which isn't too difficult to understand and relatively easy to understand and fix when something goes wrong, so it isn't expensive in terms of supportability and it does address classes of significant use cases, so I think we should just fix it. Thanks.
On Mon, Mar 20, 2023 at 11:05 PM Tejun Heo <tj@kernel.org> wrote: > > Do you remember what the other case was? Was it also on heterogenous arm > setup? Yup. See commit c25da5b7baf1 ("dm verity: stop using WQ_UNBOUND for verify_wq") But see also 3fffb589b9a6 ("erofs: add per-cpu threads for decompression as an option"). And you can see the confusion this all has in commit 43fa47cb116d ("dm verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND"), which perhaps should be undone now. > There aren't many differences between unbound workqueues and percpu ones > that aren't concurrency managed. If there are significant performance > differences, it's unlikely to be directly from whatever workqueue is doing. There's a *lot* of special cases for WQ_UNBOUND in the workqueue code, and they are a lot less targeted than the other WQ_xyz flags, I feel. They have their own cpumask logic, special freeing rules etc etc. So I would say that the "aren't many differences" is not exactly true. There are subtle and random differences, including the very basic "queue_work()" workflow. Now, I assume that the arm cases don't actually use wq_unbound_cpumask, so I assume it's mostly the "instead of local cpu queue, use the local node queue", and so it's all on random CPU's since nobody uses NUMA nodes. And no, if it's caching effects, doing it on LLC boundaries isn't rigth *either*. By default it should probably be on L2 boundaries or something, with most non-NUMA setups likely having one single LLC but multiple L2 nodes. Linus
Hello, Linus. On Tue, Mar 21, 2023 at 11:31:35AM -0700, Linus Torvalds wrote: > On Mon, Mar 20, 2023 at 11:05 PM Tejun Heo <tj@kernel.org> wrote: > > > > Do you remember what the other case was? Was it also on heterogenous arm > > setup? > > Yup. See commit c25da5b7baf1 ("dm verity: stop using WQ_UNBOUND for verify_wq") > > But see also 3fffb589b9a6 ("erofs: add per-cpu threads for > decompression as an option"). > > And you can see the confusion this all has in commit 43fa47cb116d ("dm > verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND"), which > perhaps should be undone now. Thanks for the pointers. They all seem plausible symptoms of work items getting bounced across slow cache boundaries. I'm off for a few weeks so can't really dig in right now but will get to it afterwards. > > There aren't many differences between unbound workqueues and percpu ones > > that aren't concurrency managed. If there are significant performance > > differences, it's unlikely to be directly from whatever workqueue is doing. > > There's a *lot* of special cases for WQ_UNBOUND in the workqueue code, > and they are a lot less targeted than the other WQ_xyz flags, I feel. > They have their own cpumask logic, special freeing rules etc etc. > > So I would say that the "aren't many differences" is not exactly true. > There are subtle and random differences, including the very basic > "queue_work()" workflow. Oh yeah, pwq management side is pretty involved, being dynamic and all. I just couldn't think of anything in the issue & execution path which would explain the reported significant performance penalty. The issue path differences come down to node selection and dynamic pwq release handling, neither of which should be in play in this case. > Now, I assume that the arm cases don't actually use > wq_unbound_cpumask, so I assume it's mostly the "instead of local cpu > queue, use the local node queue", and so it's all on random CPU's > since nobody uses NUMA nodes. > > And no, if it's caching effects, doing it on LLC boundaries isn't > rigth *either*. By default it should probably be on L2 boundaries or > something, with most non-NUMA setups likely having one single LLC but > multiple L2 nodes. Hmm... on recent x86 cpus, that'd just end up paring up the hyperthreads, which would likely be too narrow especially given that l3's on recent cpus seem pretty fast. I think what I need to do is generalizing the numa logic so that it can sit on any of these topological boundaries and let arch define the default boundary and let each wq to override the selection. Another related shortcoming is that while the unbound wq's say "let the scheduler figure out the best solution within the boundary", they don't communicate the locality of work item to the scheduler at all, so within each boundary, from scheduler pov, the assignment is completely random. Down the line, it'd probably help if wq can provide some hints re. the expected locality. Thanks.
On Wed, Mar 22, 2023 at 6:04 PM Tejun Heo <tj@kernel.org> wrote: > > Thanks for the pointers. They all seem plausible symptoms of work items > getting bounced across slow cache boundaries. I'm off for a few weeks so > can't really dig in right now but will get to it afterwards. So just as a gut feeling, I suspect that one solution would be to always *start* the work on the local CPU (where "local" might be the same, or at least a sibling). The only reason to migrate to another CPU would be if the work is CPU-intensive, and I do suspect that is commonly not really the case. And I strongly suspect that our WQ_CPU_INTENSIVE flag is pure garbage, and should just be gotten rid of, because what could be considered "CPU intensive" in under one situation might not be CPU intensive in another one, so trying to use some static knowledge about it is just pure guess-work. The different situations might be purely contextual things ("heavy network traffic when NAPI polling kicks in"), but it might also be purely hardware-related (ie "this is heavy if we don't have CPU hw acceleration for crypto, but cheap if we do"). So I really don't think it should be some static decision, either through WQ_CPU_INTENSIVE _or_ through "WQ_UNBOUND means schedule on first available CPU". Wouldn't it be much nicer if we just noticed it dynamically, and WQ_UNBOUND would mean that the workqueue _can_ be scheduled on another CPU if it ends up being advantageous? And we actually kind of have that dynamic flag already, in the form of the scheduler. It might even be explicit in the context of the workqueue (with "need_resched()" being true and the workqueue code itself might notice it and explicitly then try to spread it out), but with preemption it's more implicit and maybe it needs a bit of tweaking help. So that's what I mean by "start the work as local CPU work" - use that as the baseline decision (since it's going to be the case that has cache locality), and actively try to avoid spreading things out unless we have an explicit reason to, and that reason we could just get from the scheduler. The worker code already has that "wq_worker_sleeping()" callback from the scheduler, but that only triggers when a worker is going to sleep. I'm saying that the "scheduler decided to schedule out a worker" case might be used as a "Oh, this is CPU intensive, let's try to spread it out". See what I'm trying to say? And yes, the WQ_UNBOUND case does have a weak "prefer local CPU" in how it basically tends to try to pick the current CPU unless there is some active reason not to (ie the whole "wq_select_unbound_cpu()" code), but I suspect that is then counter-acted by the fact that it will always pick the workqueue pool by node - so the "current CPU" ends up probably being affected by what random CPU that pool was running on. An alternative to any scheduler interaction thing might be to just tweak "first_idle_worker()". I get the feeling that that choice is just horrid, and that is another area that could really try to take locality into account. insert_work() realyl seems to pick a random worker from the pool - which is fine when the pool is per-cpu, but when it's the unbound "random node" pool, I really suspect that it might be much better to try to pick a worker that is on the right cpu. But hey, I may be missing something. You know this code much better than I do. Linus
Hey all, On Thu, Mar 23, 2023 at 11:04 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, Mar 22, 2023 at 6:04 PM Tejun Heo <tj@kernel.org> wrote: > > > > Thanks for the pointers. They all seem plausible symptoms of work items > > getting bounced across slow cache boundaries. I'm off for a few weeks so > > can't really dig in right now but will get to it afterwards. > > So just as a gut feeling, I suspect that one solution would be to > always *start* the work on the local CPU (where "local" might be the > same, or at least a sibling). > > The only reason to migrate to another CPU would be if the work is > CPU-intensive, and I do suspect that is commonly not really the case. > > And I strongly suspect that our WQ_CPU_INTENSIVE flag is pure garbage, > and should just be gotten rid of, because what could be considered > "CPU intensive" in under one situation might not be CPU intensive in > another one, so trying to use some static knowledge about it is just > pure guess-work. > > The different situations might be purely contextual things ("heavy > network traffic when NAPI polling kicks in"), but it might also be > purely hardware-related (ie "this is heavy if we don't have CPU hw > acceleration for crypto, but cheap if we do"). > > So I really don't think it should be some static decision, either > through WQ_CPU_INTENSIVE _or_ through "WQ_UNBOUND means schedule on > first available CPU". I agree that these flags are prone to misuse. In most cases, there's no explanation for why the flags are being used. Either the flags were enabled unintentionally or the author never posted a performance justification. Imo figuring out which set of flags to set on which architecture is too much of a burden for each workqueue user. > > Wouldn't it be much nicer if we just noticed it dynamically, and > WQ_UNBOUND would mean that the workqueue _can_ be scheduled on another > CPU if it ends up being advantageous? > > And we actually kind of have that dynamic flag already, in the form of > the scheduler. It might even be explicit in the context of the > workqueue (with "need_resched()" being true and the workqueue code > itself might notice it and explicitly then try to spread it out), but > with preemption it's more implicit and maybe it needs a bit of > tweaking help. > > So that's what I mean by "start the work as local CPU work" - use that > as the baseline decision (since it's going to be the case that has > cache locality), and actively try to avoid spreading things out unless > we have an explicit reason to, and that reason we could just get from > the scheduler. This would work for the use cases I'm worried about. Most of the work items used for IO post-processing are really fast. I suspect that the interaction between frequency scaling and WQ_UNBOUND is causing the slowdowns. > > The worker code already has that "wq_worker_sleeping()" callback from > the scheduler, but that only triggers when a worker is going to sleep. > I'm saying that the "scheduler decided to schedule out a worker" case > might be used as a "Oh, this is CPU intensive, let's try to spread it > out". > > See what I'm trying to say? > > And yes, the WQ_UNBOUND case does have a weak "prefer local CPU" in > how it basically tends to try to pick the current CPU unless there is > some active reason not to (ie the whole "wq_select_unbound_cpu()" > code), but I suspect that is then counter-acted by the fact that it > will always pick the workqueue pool by node - so the "current CPU" > ends up probably being affected by what random CPU that pool was > running on. > > An alternative to any scheduler interaction thing might be to just > tweak "first_idle_worker()". I get the feeling that that choice is > just horrid, and that is another area that could really try to take > locality into account. insert_work() realyl seems to pick a random > worker from the pool - which is fine when the pool is per-cpu, but > when it's the unbound "random node" pool, I really suspect that it > might be much better to try to pick a worker that is on the right cpu. > > But hey, I may be missing something. You know this code much better than I do. > > Linus Thanks, Huck
Hello, Linus. Okay, I'm now back online. On Thu, Mar 23, 2023 at 11:04:25AM -0700, Linus Torvalds wrote: > On Wed, Mar 22, 2023 at 6:04 PM Tejun Heo <tj@kernel.org> wrote: > > > > Thanks for the pointers. They all seem plausible symptoms of work items > > getting bounced across slow cache boundaries. I'm off for a few weeks so > > can't really dig in right now but will get to it afterwards. > > So just as a gut feeling, I suspect that one solution would be to > always *start* the work on the local CPU (where "local" might be the > same, or at least a sibling). Yeah, that seems like the sanest way to leverage the scheduler. The only complication is around tracking which workers were on which CPUs and how sticky the cpu association should be (e.g. we don't want to unnecessarily jump workers across CPUs but we probably don't want to maintain strict per-cpu worker pools either). I'll try to come up with a reasonable trade-off which isn't too complicated. > The only reason to migrate to another CPU would be if the work is > CPU-intensive, and I do suspect that is commonly not really the case. > > And I strongly suspect that our WQ_CPU_INTENSIVE flag is pure garbage, > and should just be gotten rid of, because what could be considered > "CPU intensive" in under one situation might not be CPU intensive in > another one, so trying to use some static knowledge about it is just > pure guess-work. > > The different situations might be purely contextual things ("heavy > network traffic when NAPI polling kicks in"), but it might also be > purely hardware-related (ie "this is heavy if we don't have CPU hw > acceleration for crypto, but cheap if we do"). > > So I really don't think it should be some static decision, either > through WQ_CPU_INTENSIVE _or_ through "WQ_UNBOUND means schedule on > first available CPU". > > Wouldn't it be much nicer if we just noticed it dynamically, and > WQ_UNBOUND would mean that the workqueue _can_ be scheduled on another > CPU if it ends up being advantageous? > > And we actually kind of have that dynamic flag already, in the form of > the scheduler. It might even be explicit in the context of the > workqueue (with "need_resched()" being true and the workqueue code > itself might notice it and explicitly then try to spread it out), but > with preemption it's more implicit and maybe it needs a bit of > tweaking help. Yeah, CPU_INTENSIVE was added as an easy (to implement) way out for cpu hogging percpu work items. Given that percpu workers track the scheduling events anyway whether from preemption or explicit schedule(), it should be possible to remove it while maintaining most of the benefits of worker concurrency management. Because the scheduler isn't aware of work item boundaries, workqueue can't blindly use scheduling events but that's easy to resolve with an extra timestamp. I'll think more about whether it'd be a good idea to subject unbound workers to concurrency management before it gets spread out so that the only distinction between percpu and unbound is whether the work item can be booted off cpu when they run for too long while being subject to the same concurrency control before that point. > So that's what I mean by "start the work as local CPU work" - use that > as the baseline decision (since it's going to be the case that has > cache locality), and actively try to avoid spreading things out unless > we have an explicit reason to, and that reason we could just get from > the scheduler. > > The worker code already has that "wq_worker_sleeping()" callback from > the scheduler, but that only triggers when a worker is going to sleep. > I'm saying that the "scheduler decided to schedule out a worker" case > might be used as a "Oh, this is CPU intensive, let's try to spread it > out". > > See what I'm trying to say? Yeah, lemme look into it. It'd be great to simplify workqueue usage and actually make it leverage what the scheduler knows about what should run where. Thanks.