mbox series

[GIT,PULL] fsverity fixes for v6.3-rc4

Message ID 20230320210724.GB1434@sol.localdomain (mailing list archive)
State Mainlined, archived
Headers show
Series [GIT,PULL] fsverity fixes for v6.3-rc4 | expand

Pull-request

https://git.kernel.org/pub/scm/fs/fsverity/linux.git tags/fsverity-for-linus

Message

Eric Biggers March 20, 2023, 9:07 p.m. UTC
The following changes since commit eeac8ede17557680855031c6f305ece2378af326:

  Linux 6.3-rc2 (2023-03-12 16:36:44 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/fs/fsverity/linux.git tags/fsverity-for-linus

for you to fetch changes up to a075bacde257f755bea0e53400c9f1cdd1b8e8e6:

  fsverity: don't drop pagecache at end of FS_IOC_ENABLE_VERITY (2023-03-15 22:50:41 -0700)

----------------------------------------------------------------

Fix two significant performance issues with fsverity.

----------------------------------------------------------------
Eric Biggers (1):
      fsverity: don't drop pagecache at end of FS_IOC_ENABLE_VERITY

Nathan Huckleberry (1):
      fsverity: Remove WQ_UNBOUND from fsverity read workqueue

 fs/verity/enable.c | 25 +++++++++++++------------
 fs/verity/verify.c | 12 ++++++------
 2 files changed, 19 insertions(+), 18 deletions(-)

Comments

Linus Torvalds March 20, 2023, 10:31 p.m. UTC | #1
On Mon, Mar 20, 2023 at 2:07 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> Nathan Huckleberry (1):
>       fsverity: Remove WQ_UNBOUND from fsverity read workqueue

There's a *lot* of other WQ_UNBOUND users. If it performs that badly,
maybe there is something wrong with the workqueue code.

Should people be warned to not use WQ_UNBOUND - or is there something
very special about fsverity?

Added Tejun to the cc. With one of the main documented reasons for
WQ_UNBOUND being performance (both implicit "try to start execution of
work items as soon as possible") and explicit ("CPU intensive
workloads which can be better managed by the system scheduler"), maybe
it's time to reconsider?

WQ_UNBOUND adds a fair amount of complexity and special cases to the
workqueues, and this is now the second "let's remove it because it's
hurting things in a big way".

              Linus
pr-tracker-bot@kernel.org March 20, 2023, 10:41 p.m. UTC | #2
The pull request you sent on Mon, 20 Mar 2023 14:07:24 -0700:

> https://git.kernel.org/pub/scm/fs/fsverity/linux.git tags/fsverity-for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/17214b70a159c6547df9ae204a6275d983146f6b

Thank you!
Eric Biggers March 20, 2023, 11:16 p.m. UTC | #3
On Mon, Mar 20, 2023 at 03:31:13PM -0700, Linus Torvalds wrote:
> On Mon, Mar 20, 2023 at 2:07 PM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > Nathan Huckleberry (1):
> >       fsverity: Remove WQ_UNBOUND from fsverity read workqueue
> 
> There's a *lot* of other WQ_UNBOUND users. If it performs that badly,
> maybe there is something wrong with the workqueue code.
> 
> Should people be warned to not use WQ_UNBOUND - or is there something
> very special about fsverity?
> 
> Added Tejun to the cc. With one of the main documented reasons for
> WQ_UNBOUND being performance (both implicit "try to start execution of
> work items as soon as possible") and explicit ("CPU intensive
> workloads which can be better managed by the system scheduler"), maybe
> it's time to reconsider?
> 
> WQ_UNBOUND adds a fair amount of complexity and special cases to the
> workqueues, and this is now the second "let's remove it because it's
> hurting things in a big way".
> 
>               Linus

So, Nathan has been doing most of the investigation and testing on this, and
he's out of office at the moment.

But, my understanding is that since modern CPUs have acceleration for all the
common crypto algorithms (including fsverity's SHA-256), the work items just
don't take long enough for the overhead of a context switch to be worth it.
WQ_UNBOUND seems to be optimized for much longer running work items.

Additionally, the WQ_UNBOUND overhead is particularly bad on arm64.  We aren't
sure of the reason for this.  Nathan thinks this is probably related to overhead
of saving/restoring the FPU+SIMD state.  My theory is that it's mainly caused by
heterogeneous processors, where work that would ordinarily run on the fastest
CPU core gets scheduled on a slow CPU core.  Maybe it's a combination of both.

WQ_UNBOUND has been shown to be detrimental to EROFS decompression and to
dm-verity too, so this isn't specific to fsverity.  (fscrypt is still under
investigation.  I'd guess the same applies, but it's been less of a priority
since fscrypt doesn't use a workqueue when inline encryption is being used.)

These are all "I/O post-processing cases", though, so all sort of similar.

- Eric
Tejun Heo March 21, 2023, 6:05 a.m. UTC | #4
Hello,

(cc'ing Lai.)

On Mon, Mar 20, 2023 at 03:31:13PM -0700, Linus Torvalds wrote:
> On Mon, Mar 20, 2023 at 2:07 PM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > Nathan Huckleberry (1):
> >       fsverity: Remove WQ_UNBOUND from fsverity read workqueue
> 
> There's a *lot* of other WQ_UNBOUND users. If it performs that badly,
> maybe there is something wrong with the workqueue code.
>
> Should people be warned to not use WQ_UNBOUND - or is there something
> very special about fsverity?
> 
> Added Tejun to the cc. With one of the main documented reasons for
> WQ_UNBOUND being performance (both implicit "try to start execution of
> work items as soon as possible") and explicit ("CPU intensive
> workloads which can be better managed by the system scheduler"), maybe
> it's time to reconsider?
> 
> WQ_UNBOUND adds a fair amount of complexity and special cases to the
> workqueues, and this is now the second "let's remove it because it's
> hurting things in a big way".

Do you remember what the other case was? Was it also on heterogenous arm
setup?

There aren't many differences between unbound workqueues and percpu ones
that aren't concurrency managed. If there are significant performance
differences, it's unlikely to be directly from whatever workqueue is doing.

One obvious thing that comes to mind is that WQ_UNBOUND may be pushing tasks
across expensive cache boundaries (e.g. across cores that are living on
separate L3 complexes). This isn't a totally new problem and workqueue has
some topology awareness, by default, WQ_UNBOUND pools are segregated across
NUMA boundaries. This used to be fine but I think it's likely outmoded now.
given that non-trivial cache hierarchies on top of UMA or inside a node are
a thing these days.

Looking at f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity read
workqueue"), I feel a bit uneasy. This would be fine on a setup which does
moderate amount of IOs on CPUs with quick enough accelration mechanisms, but
that's not the whole world. Use cases that generate extreme amount of IOs do
depend on the ability to fan out IO related work items across multiple CPUs
especially if the IOs coincide with network activities. So, my intuition is
that the commit is fixing a subset of use cases while likely regressing
others.

If the cache theory is correct, the right thing to do would be making
workqueue init code a bit smarter so that it segements unbound pools on LLC
boundaries rather than NUMA, which would make more sense on recent AMD chips
too. Nathan, can you run `hwloc-ls` on the affected setup (or `lstopo
out.pdf`) and attach the output?

As for the overhead of supporting WQ_UNBOUND, it does add non-trivial amount
of complexity but of the boring kind. It's all managerial stuff which isn't
too difficult to understand and relatively easy to understand and fix when
something goes wrong, so it isn't expensive in terms of supportability and
it does address classes of significant use cases, so I think we should just
fix it.

Thanks.
Linus Torvalds March 21, 2023, 6:31 p.m. UTC | #5
On Mon, Mar 20, 2023 at 11:05 PM Tejun Heo <tj@kernel.org> wrote:
>
> Do you remember what the other case was? Was it also on heterogenous arm
> setup?

Yup. See commit c25da5b7baf1 ("dm verity: stop using WQ_UNBOUND for verify_wq")

But see also 3fffb589b9a6 ("erofs: add per-cpu threads for
decompression as an option").

And you can see the confusion this all has in commit 43fa47cb116d ("dm
verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND"), which
perhaps should be undone now.

> There aren't many differences between unbound workqueues and percpu ones
> that aren't concurrency managed. If there are significant performance
> differences, it's unlikely to be directly from whatever workqueue is doing.

There's a *lot* of special cases for WQ_UNBOUND in the workqueue code,
and they are a lot less targeted than the other WQ_xyz flags, I feel.
They have their own cpumask logic, special freeing rules etc etc.

So I would say that the "aren't many differences" is not exactly true.
There are subtle and random differences, including the very basic
"queue_work()" workflow.

Now, I assume that the arm cases don't actually use
wq_unbound_cpumask, so I assume it's mostly the "instead of local cpu
queue, use the local node queue", and so it's all on random CPU's
since nobody uses NUMA nodes.

And no, if it's caching effects, doing it on LLC boundaries isn't
rigth *either*. By default it should probably be on L2 boundaries or
something, with most non-NUMA setups likely having one single LLC but
multiple L2 nodes.

              Linus
Tejun Heo March 23, 2023, 1:04 a.m. UTC | #6
Hello, Linus.

On Tue, Mar 21, 2023 at 11:31:35AM -0700, Linus Torvalds wrote:
> On Mon, Mar 20, 2023 at 11:05 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > Do you remember what the other case was? Was it also on heterogenous arm
> > setup?
> 
> Yup. See commit c25da5b7baf1 ("dm verity: stop using WQ_UNBOUND for verify_wq")
> 
> But see also 3fffb589b9a6 ("erofs: add per-cpu threads for
> decompression as an option").
> 
> And you can see the confusion this all has in commit 43fa47cb116d ("dm
> verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND"), which
> perhaps should be undone now.

Thanks for the pointers. They all seem plausible symptoms of work items
getting bounced across slow cache boundaries. I'm off for a few weeks so
can't really dig in right now but will get to it afterwards.

> > There aren't many differences between unbound workqueues and percpu ones
> > that aren't concurrency managed. If there are significant performance
> > differences, it's unlikely to be directly from whatever workqueue is doing.
> 
> There's a *lot* of special cases for WQ_UNBOUND in the workqueue code,
> and they are a lot less targeted than the other WQ_xyz flags, I feel.
> They have their own cpumask logic, special freeing rules etc etc.
>
> So I would say that the "aren't many differences" is not exactly true.
> There are subtle and random differences, including the very basic
> "queue_work()" workflow.

Oh yeah, pwq management side is pretty involved, being dynamic and all. I
just couldn't think of anything in the issue & execution path which would
explain the reported significant performance penalty. The issue path
differences come down to node selection and dynamic pwq release handling,
neither of which should be in play in this case.

> Now, I assume that the arm cases don't actually use
> wq_unbound_cpumask, so I assume it's mostly the "instead of local cpu
> queue, use the local node queue", and so it's all on random CPU's
> since nobody uses NUMA nodes.
> 
> And no, if it's caching effects, doing it on LLC boundaries isn't
> rigth *either*. By default it should probably be on L2 boundaries or
> something, with most non-NUMA setups likely having one single LLC but
> multiple L2 nodes.

Hmm... on recent x86 cpus, that'd just end up paring up the hyperthreads,
which would likely be too narrow especially given that l3's on recent cpus
seem pretty fast. I think what I need to do is generalizing the numa logic
so that it can sit on any of these topological boundaries and let arch
define the default boundary and let each wq to override the selection.

Another related shortcoming is that while the unbound wq's say "let the
scheduler figure out the best solution within the boundary", they don't
communicate the locality of work item to the scheduler at all, so within
each boundary, from scheduler pov, the assignment is completely random. Down
the line, it'd probably help if wq can provide some hints re. the expected
locality.

Thanks.
Linus Torvalds March 23, 2023, 6:04 p.m. UTC | #7
On Wed, Mar 22, 2023 at 6:04 PM Tejun Heo <tj@kernel.org> wrote:
>
> Thanks for the pointers. They all seem plausible symptoms of work items
> getting bounced across slow cache boundaries. I'm off for a few weeks so
> can't really dig in right now but will get to it afterwards.

So just as a gut feeling, I suspect that one solution would be to
always *start* the work on the local CPU (where "local" might be the
same, or at least a sibling).

The only reason to migrate to another CPU would be if the work is
CPU-intensive, and I do suspect that is commonly not really the case.

And I strongly suspect that our WQ_CPU_INTENSIVE flag is pure garbage,
and should just be gotten rid of, because what could be considered
"CPU intensive" in under one situation might not be CPU intensive in
another one, so trying to use some static knowledge about it is just
pure guess-work.

The different situations might be purely contextual things ("heavy
network traffic when NAPI polling kicks in"), but it might also be
purely hardware-related (ie "this is heavy if we don't have CPU hw
acceleration for crypto, but cheap if we do").

So I really don't think it should be some static decision, either
through WQ_CPU_INTENSIVE _or_ through "WQ_UNBOUND means schedule on
first available CPU".

Wouldn't it be much nicer if we just noticed it dynamically, and
WQ_UNBOUND would mean that the workqueue _can_ be scheduled on another
CPU if it ends up being advantageous?

And we actually kind of have that dynamic flag already, in the form of
the scheduler. It might even be explicit in the context of the
workqueue (with "need_resched()" being true and the workqueue code
itself might notice it and explicitly then try to spread it out), but
with preemption it's more implicit and maybe it needs a bit of
tweaking help.

So that's what I mean by "start the work as local CPU work" - use that
as the baseline decision (since it's going to be the case that has
cache locality), and actively try to avoid spreading things out unless
we have an explicit reason to, and that reason we could just get from
the scheduler.

The worker code already has that "wq_worker_sleeping()" callback from
the scheduler, but that only triggers when a worker is going to sleep.
I'm saying that the "scheduler decided to schedule out a worker" case
might be used as a "Oh, this is CPU intensive, let's try to spread it
out".

See what I'm trying to say?

And yes, the WQ_UNBOUND case does have a weak "prefer local CPU" in
how it basically tends to try to pick the current CPU unless there is
some active reason not to (ie the whole "wq_select_unbound_cpu()"
code), but I suspect that is then counter-acted by the fact that it
will always pick the workqueue pool by node - so the "current CPU"
ends up probably being affected by what random CPU that pool was
running on.

An alternative to any scheduler interaction thing might be to just
tweak "first_idle_worker()". I get the feeling that that choice is
just horrid, and that is another area that could really try to take
locality into account. insert_work() realyl seems to pick a random
worker from the pool - which is fine when the pool is per-cpu, but
when it's the unbound "random node" pool, I really suspect that it
might be much better to try to pick a worker that is on the right cpu.

But hey, I may be missing something. You know this code much better than I do.

                  Linus
Nathan Huckleberry March 28, 2023, 9:36 p.m. UTC | #8
Hey all,
On Thu, Mar 23, 2023 at 11:04 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Mar 22, 2023 at 6:04 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > Thanks for the pointers. They all seem plausible symptoms of work items
> > getting bounced across slow cache boundaries. I'm off for a few weeks so
> > can't really dig in right now but will get to it afterwards.
>
> So just as a gut feeling, I suspect that one solution would be to
> always *start* the work on the local CPU (where "local" might be the
> same, or at least a sibling).
>
> The only reason to migrate to another CPU would be if the work is
> CPU-intensive, and I do suspect that is commonly not really the case.
>
> And I strongly suspect that our WQ_CPU_INTENSIVE flag is pure garbage,
> and should just be gotten rid of, because what could be considered
> "CPU intensive" in under one situation might not be CPU intensive in
> another one, so trying to use some static knowledge about it is just
> pure guess-work.
>
> The different situations might be purely contextual things ("heavy
> network traffic when NAPI polling kicks in"), but it might also be
> purely hardware-related (ie "this is heavy if we don't have CPU hw
> acceleration for crypto, but cheap if we do").
>
> So I really don't think it should be some static decision, either
> through WQ_CPU_INTENSIVE _or_ through "WQ_UNBOUND means schedule on
> first available CPU".

I agree that these flags are prone to misuse. In most cases, there's
no explanation for why the flags are being used. Either the flags were
enabled unintentionally or the author never posted a performance
justification.

Imo figuring out which set of flags to set on which architecture is
too much of a burden for each workqueue user.

>
> Wouldn't it be much nicer if we just noticed it dynamically, and
> WQ_UNBOUND would mean that the workqueue _can_ be scheduled on another
> CPU if it ends up being advantageous?
>
> And we actually kind of have that dynamic flag already, in the form of
> the scheduler. It might even be explicit in the context of the
> workqueue (with "need_resched()" being true and the workqueue code
> itself might notice it and explicitly then try to spread it out), but
> with preemption it's more implicit and maybe it needs a bit of
> tweaking help.
>
> So that's what I mean by "start the work as local CPU work" - use that
> as the baseline decision (since it's going to be the case that has
> cache locality), and actively try to avoid spreading things out unless
> we have an explicit reason to, and that reason we could just get from
> the scheduler.

This would work for the use cases I'm worried about. Most of the work
items used for IO post-processing are really fast. I suspect that the
interaction between frequency scaling and WQ_UNBOUND is causing the
slowdowns.

>
> The worker code already has that "wq_worker_sleeping()" callback from
> the scheduler, but that only triggers when a worker is going to sleep.
> I'm saying that the "scheduler decided to schedule out a worker" case
> might be used as a "Oh, this is CPU intensive, let's try to spread it
> out".
>
> See what I'm trying to say?
>
> And yes, the WQ_UNBOUND case does have a weak "prefer local CPU" in
> how it basically tends to try to pick the current CPU unless there is
> some active reason not to (ie the whole "wq_select_unbound_cpu()"
> code), but I suspect that is then counter-acted by the fact that it
> will always pick the workqueue pool by node - so the "current CPU"
> ends up probably being affected by what random CPU that pool was
> running on.
>
> An alternative to any scheduler interaction thing might be to just
> tweak "first_idle_worker()". I get the feeling that that choice is
> just horrid, and that is another area that could really try to take
> locality into account. insert_work() realyl seems to pick a random
> worker from the pool - which is fine when the pool is per-cpu, but
> when it's the unbound "random node" pool, I really suspect that it
> might be much better to try to pick a worker that is on the right cpu.
>
> But hey, I may be missing something. You know this code much better than I do.
>
>                   Linus

Thanks,
Huck
Tejun Heo April 13, 2023, 12:25 a.m. UTC | #9
Hello, Linus.

Okay, I'm now back online.

On Thu, Mar 23, 2023 at 11:04:25AM -0700, Linus Torvalds wrote:
> On Wed, Mar 22, 2023 at 6:04 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > Thanks for the pointers. They all seem plausible symptoms of work items
> > getting bounced across slow cache boundaries. I'm off for a few weeks so
> > can't really dig in right now but will get to it afterwards.
> 
> So just as a gut feeling, I suspect that one solution would be to
> always *start* the work on the local CPU (where "local" might be the
> same, or at least a sibling).

Yeah, that seems like the sanest way to leverage the scheduler. The only
complication is around tracking which workers were on which CPUs and how
sticky the cpu association should be (e.g. we don't want to unnecessarily
jump workers across CPUs but we probably don't want to maintain strict
per-cpu worker pools either). I'll try to come up with a reasonable
trade-off which isn't too complicated.

> The only reason to migrate to another CPU would be if the work is
> CPU-intensive, and I do suspect that is commonly not really the case.
> 
> And I strongly suspect that our WQ_CPU_INTENSIVE flag is pure garbage,
> and should just be gotten rid of, because what could be considered
> "CPU intensive" in under one situation might not be CPU intensive in
> another one, so trying to use some static knowledge about it is just
> pure guess-work.
> 
> The different situations might be purely contextual things ("heavy
> network traffic when NAPI polling kicks in"), but it might also be
> purely hardware-related (ie "this is heavy if we don't have CPU hw
> acceleration for crypto, but cheap if we do").
> 
> So I really don't think it should be some static decision, either
> through WQ_CPU_INTENSIVE _or_ through "WQ_UNBOUND means schedule on
> first available CPU".
> 
> Wouldn't it be much nicer if we just noticed it dynamically, and
> WQ_UNBOUND would mean that the workqueue _can_ be scheduled on another
> CPU if it ends up being advantageous?
> 
> And we actually kind of have that dynamic flag already, in the form of
> the scheduler. It might even be explicit in the context of the
> workqueue (with "need_resched()" being true and the workqueue code
> itself might notice it and explicitly then try to spread it out), but
> with preemption it's more implicit and maybe it needs a bit of
> tweaking help.

Yeah, CPU_INTENSIVE was added as an easy (to implement) way out for cpu
hogging percpu work items. Given that percpu workers track the scheduling
events anyway whether from preemption or explicit schedule(), it should be
possible to remove it while maintaining most of the benefits of worker
concurrency management. Because the scheduler isn't aware of work item
boundaries, workqueue can't blindly use scheduling events but that's easy to
resolve with an extra timestamp.

I'll think more about whether it'd be a good idea to subject unbound workers
to concurrency management before it gets spread out so that the only
distinction between percpu and unbound is whether the work item can be
booted off cpu when they run for too long while being subject to the same
concurrency control before that point.

> So that's what I mean by "start the work as local CPU work" - use that
> as the baseline decision (since it's going to be the case that has
> cache locality), and actively try to avoid spreading things out unless
> we have an explicit reason to, and that reason we could just get from
> the scheduler.
> 
> The worker code already has that "wq_worker_sleeping()" callback from
> the scheduler, but that only triggers when a worker is going to sleep.
> I'm saying that the "scheduler decided to schedule out a worker" case
> might be used as a "Oh, this is CPU intensive, let's try to spread it
> out".
> 
> See what I'm trying to say?

Yeah, lemme look into it. It'd be great to simplify workqueue usage and
actually make it leverage what the scheduler knows about what should run
where.

Thanks.