diff mbox series

mm: Always sanity check anon_vma first for per-vma locks

Message ID 20240410170621.2011171-1-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm: Always sanity check anon_vma first for per-vma locks | expand

Commit Message

Peter Xu April 10, 2024, 5:06 p.m. UTC
anon_vma is a tricky object in the context of per-vma lock, because it's
racy to modify it in that context and mmap lock is needed if it's not
stable yet.

So far there are three places that sanity checks anon_vma for that:

  - lock_vma_under_rcu(): this is the major entrance of per-vma lock, where
    we have taken care of anon memory v.s. potential anon_vma allocations.

  - lock_vma(): even if it looks so generic as an API, it's only used in
    userfaultfd context to leverage per-vma locks.  It does extra check
    over MAP_PRIVATE file mappings for the same anon_vma issue.

  - vmf_anon_prepare(): it works for private file mapping faults just like
    what lock_vma() wanted to cover above.  One trivial difference is in
    some extremely corner case, the fault handler will still allow per-vma
    fault to happen, like a READ on a privately mapped file.

The question is whether that's intended to make it as complicated.  Per my
question in the thread, it is not intended, and Suren also seems to agree [1].

So the trivial side effect of such patch is:

  - We may do slightly better on the first WRITE of a private file mapping,
  because we can retry earlier (in lock_vma_under_rcu(), rather than
  vmf_anon_prepare() later).

  - We may always use mmap lock for the initial READs on a private file
  mappings, while before this patch it _can_ (only when no WRITE ever
  happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
  read fault with per-vma lock.

Then noted that right after any WRITE the anon_vma will be stablized, then
there will be no difference.  And I believe that should be the majority
cases too; I also did try to run a program, writting to MAP_PRIVATE file
memory (that I pre-headed in the page cache) and I can hardly measure a
difference in performance.

Let's simply ignore all those trivial corner cases and unify the anon_vma
check from three places into one.  I also didn't check the rest users of
lock_vma_under_rcu(), where in a !fault context it could even fix something
that used to race with private file mappings but I didn't check further.

I still left a WARN_ON_ONCE() in vmf_anon_prepare() to double check we're
all good.

[1] https://lore.kernel.org/r/CAJuCfpGj5xk-NxSwW6Mt8NGZcV9N__8zVPMGXDPAYKMcN9=Oig@mail.gmail.com

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c      | 10 ++++------
 mm/userfaultfd.c | 13 ++-----------
 2 files changed, 6 insertions(+), 17 deletions(-)

Comments

Matthew Wilcox April 10, 2024, 8:26 p.m. UTC | #1
On Wed, Apr 10, 2024 at 01:06:21PM -0400, Peter Xu wrote:
> anon_vma is a tricky object in the context of per-vma lock, because it's
> racy to modify it in that context and mmap lock is needed if it's not
> stable yet.

I object to this commit message.  First, it's not a "sanity check".  It's
a check to see if we already have an anon VMA.  Second, it's not "racy
to modify it" at all.  The problem is that we need to look at other
VMAs, for which we do not hold the lock.

> So the trivial side effect of such patch is:
> 
>   - We may do slightly better on the first WRITE of a private file mapping,
>   because we can retry earlier (in lock_vma_under_rcu(), rather than
>   vmf_anon_prepare() later).
> 
>   - We may always use mmap lock for the initial READs on a private file
>   mappings, while before this patch it _can_ (only when no WRITE ever
>   happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
>   read fault with per-vma lock.

But that's a super common path!  Look at 'cat /proc/self/maps'.  All
your program text (including libraries) is mapped PRIVATE, and never
written to (except by ptrace, I guess).

NAK this patch.
Matthew Wilcox April 10, 2024, 9:10 p.m. UTC | #2
On Wed, Apr 10, 2024 at 04:43:51PM -0400, Peter Xu wrote:
> On Wed, Apr 10, 2024 at 09:26:45PM +0100, Matthew Wilcox wrote:
> > On Wed, Apr 10, 2024 at 01:06:21PM -0400, Peter Xu wrote:
> > > anon_vma is a tricky object in the context of per-vma lock, because it's
> > > racy to modify it in that context and mmap lock is needed if it's not
> > > stable yet.
> > 
> > I object to this commit message.  First, it's not a "sanity check".  It's
> > a check to see if we already have an anon VMA.  Second, it's not "racy
> > to modify it" at all.  The problem is that we need to look at other
> > VMAs, for which we do not hold the lock.
> 
> For that "do not hold locks" part, isn't that "racy"?

No.

> > >   - We may always use mmap lock for the initial READs on a private file
> > >   mappings, while before this patch it _can_ (only when no WRITE ever
> > >   happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
> > >   read fault with per-vma lock.
> > 
> > But that's a super common path!  Look at 'cat /proc/self/maps'.  All
> > your program text (including libraries) is mapped PRIVATE, and never
> > written to (except by ptrace, I guess).
> > 
> > NAK this patch.
> 
> We're talking about any vma that will first benefit from a per-vma lock
> here, right?
> 
> I think it should be only relevant to some major VMA or bunch of VMAs that
> an userspace maps explicitly, then iiuc the goal is we want to reduce the
> cache bouncing of the lock when it used to be per-mm, by replacing it with
> a finer lock.  It doesn't sound right that these libraries even fall into
> this category as they should just get loaded soon enough when the program
> starts.
> 
> IOW, my understanding is that per-vma lock doesn't benefit from such normal
> vmas or simple programs that much; we take either per-vma read lock, or
> mmap read lock, and I would expect similar performance when such cache
> bouncing isn't heavy.
> 
> I can do some tests later today or tomorrow. Any suggestion you have on
> amplifying such effect that you have concern with?

8 socket NUMA system, 800MB text segment, 10,000 threads.  No, I'm not
joking, that's a real customer workload.
Peter Xu April 10, 2024, 9:23 p.m. UTC | #3
On Wed, Apr 10, 2024 at 10:10:45PM +0100, Matthew Wilcox wrote:
> > I can do some tests later today or tomorrow. Any suggestion you have on
> > amplifying such effect that you have concern with?
> 
> 8 socket NUMA system, 800MB text segment, 10,000 threads.  No, I'm not
> joking, that's a real customer workload.

Well, I believe you, but even with this, that's a total of 800MB memory on
a giant moster system... probably just to fault in once.

And even before we talk about that into details.. we're talking about such
giant program running acorss hundreds of cores with hundreds of MB text,
then... hasn't the program developer already considered mlockall() at the
entry of the program?  Wouldn't that greatly beneficial already with
whatever granule of locks that a future fault would take?
Matthew Wilcox April 10, 2024, 11:59 p.m. UTC | #4
On Wed, Apr 10, 2024 at 05:23:18PM -0400, Peter Xu wrote:
> On Wed, Apr 10, 2024 at 10:10:45PM +0100, Matthew Wilcox wrote:
> > > I can do some tests later today or tomorrow. Any suggestion you have on
> > > amplifying such effect that you have concern with?
> > 
> > 8 socket NUMA system, 800MB text segment, 10,000 threads.  No, I'm not
> > joking, that's a real customer workload.
> 
> Well, I believe you, but even with this, that's a total of 800MB memory on
> a giant moster system... probably just to fault in once.
> 
> And even before we talk about that into details.. we're talking about such
> giant program running acorss hundreds of cores with hundreds of MB text,
> then... hasn't the program developer already considered mlockall() at the
> entry of the program?  Wouldn't that greatly beneficial already with
> whatever granule of locks that a future fault would take?

I don't care what your theory is, or even what your benchmarking shows.
I had basically the inverse of this patch, and my customer's workload
showed significant improvement as a result.  Data talks, bullshit walks.
Your patch is NAKed and will remain NAKed.
Peter Xu April 11, 2024, 12:20 a.m. UTC | #5
On Thu, Apr 11, 2024 at 12:59:09AM +0100, Matthew Wilcox wrote:
> On Wed, Apr 10, 2024 at 05:23:18PM -0400, Peter Xu wrote:
> > On Wed, Apr 10, 2024 at 10:10:45PM +0100, Matthew Wilcox wrote:
> > > > I can do some tests later today or tomorrow. Any suggestion you have on
> > > > amplifying such effect that you have concern with?
> > > 
> > > 8 socket NUMA system, 800MB text segment, 10,000 threads.  No, I'm not
> > > joking, that's a real customer workload.
> > 
> > Well, I believe you, but even with this, that's a total of 800MB memory on
> > a giant moster system... probably just to fault in once.
> > 
> > And even before we talk about that into details.. we're talking about such
> > giant program running acorss hundreds of cores with hundreds of MB text,
> > then... hasn't the program developer already considered mlockall() at the
> > entry of the program?  Wouldn't that greatly beneficial already with
> > whatever granule of locks that a future fault would take?
> 
> I don't care what your theory is, or even what your benchmarking shows.
> I had basically the inverse of this patch, and my customer's workload
> showed significant improvement as a result.  Data talks, bullshit walks.
> Your patch is NAKed and will remain NAKed.

Either would you tell me your workload, I may try it.

Or, please explain why it helps?  If such huge library is in a single VMA,
I don't see why per-vma lock is better than mmap lock.  If the text is
combined with multiple vmas, it should only help when each core faults at
least on different vmas, not the same.

Would you go either way, please?

For now my patch got strongly NACKed without yet a real proof.  If that
really matters, I am happy to learn, and I agree this patch shouldn't go in
if that's provided.  Otherwise I am not convinced.  If you think data
talks, I'm happy to try any workload that I am in access with, then we
compare data.
Matthew Wilcox April 11, 2024, 2:50 p.m. UTC | #6
On Wed, Apr 10, 2024 at 08:20:04PM -0400, Peter Xu wrote:
> On Thu, Apr 11, 2024 at 12:59:09AM +0100, Matthew Wilcox wrote:
> > On Wed, Apr 10, 2024 at 05:23:18PM -0400, Peter Xu wrote:
> > > On Wed, Apr 10, 2024 at 10:10:45PM +0100, Matthew Wilcox wrote:
> > > > > I can do some tests later today or tomorrow. Any suggestion you have on
> > > > > amplifying such effect that you have concern with?
> > > > 
> > > > 8 socket NUMA system, 800MB text segment, 10,000 threads.  No, I'm not
> > > > joking, that's a real customer workload.
> > > 
> > > Well, I believe you, but even with this, that's a total of 800MB memory on
> > > a giant moster system... probably just to fault in once.
> > > 
> > > And even before we talk about that into details.. we're talking about such
> > > giant program running acorss hundreds of cores with hundreds of MB text,
> > > then... hasn't the program developer already considered mlockall() at the
> > > entry of the program?  Wouldn't that greatly beneficial already with
> > > whatever granule of locks that a future fault would take?
> > 
> > I don't care what your theory is, or even what your benchmarking shows.
> > I had basically the inverse of this patch, and my customer's workload
> > showed significant improvement as a result.  Data talks, bullshit walks.
> > Your patch is NAKed and will remain NAKed.
> 
> Either would you tell me your workload, I may try it.
> 
> Or, please explain why it helps?  If such huge library is in a single VMA,
> I don't see why per-vma lock is better than mmap lock.  If the text is
> combined with multiple vmas, it should only help when each core faults at
> least on different vmas, not the same.

Oh, you really don't understand.  The mmap_lock is catastrophically
overloaded.  Before the per-VMA lock, every page fault took it for read,
and every call to mmap() took it for write.  Because our rwsems are
fair, once one thread has called mmap() it waits for all existing page
faults to complete _and_ blocks all page faults from starting until
it has completed.  That's a huge source of unexpected latency for any
multithreaded application.

Anything we can do to avoid taking the mmap_sem, even for read, helps any
multithreaded workload.  Your suggestion that "this is rare, it doesn't
matter" shows that you don't get it.  That you haven't found a workload
where you can measure it shows that your testing is inadequate.

Yes, there's added complexity with the per-VMA locks.  But we need it for
good performance.  Throwing away performance on a very small reduction
in complexity is a terrible trade-off.
Peter Xu April 11, 2024, 3:34 p.m. UTC | #7
On Thu, Apr 11, 2024 at 03:50:54PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 10, 2024 at 08:20:04PM -0400, Peter Xu wrote:
> > On Thu, Apr 11, 2024 at 12:59:09AM +0100, Matthew Wilcox wrote:
> > > On Wed, Apr 10, 2024 at 05:23:18PM -0400, Peter Xu wrote:
> > > > On Wed, Apr 10, 2024 at 10:10:45PM +0100, Matthew Wilcox wrote:
> > > > > > I can do some tests later today or tomorrow. Any suggestion you have on
> > > > > > amplifying such effect that you have concern with?
> > > > > 
> > > > > 8 socket NUMA system, 800MB text segment, 10,000 threads.  No, I'm not
> > > > > joking, that's a real customer workload.
> > > > 
> > > > Well, I believe you, but even with this, that's a total of 800MB memory on
> > > > a giant moster system... probably just to fault in once.
> > > > 
> > > > And even before we talk about that into details.. we're talking about such
> > > > giant program running acorss hundreds of cores with hundreds of MB text,
> > > > then... hasn't the program developer already considered mlockall() at the
> > > > entry of the program?  Wouldn't that greatly beneficial already with
> > > > whatever granule of locks that a future fault would take?
> > > 
> > > I don't care what your theory is, or even what your benchmarking shows.
> > > I had basically the inverse of this patch, and my customer's workload
> > > showed significant improvement as a result.  Data talks, bullshit walks.
> > > Your patch is NAKed and will remain NAKed.
> > 
> > Either would you tell me your workload, I may try it.
> > 
> > Or, please explain why it helps?  If such huge library is in a single VMA,
> > I don't see why per-vma lock is better than mmap lock.  If the text is
> > combined with multiple vmas, it should only help when each core faults at
> > least on different vmas, not the same.
> 
> Oh, you really don't understand.  The mmap_lock is catastrophically
> overloaded.  Before the per-VMA lock, every page fault took it for read,
> and every call to mmap() took it for write.  Because our rwsems are
> fair, once one thread has called mmap() it waits for all existing page
> faults to complete _and_ blocks all page faults from starting until
> it has completed.  That's a huge source of unexpected latency for any
> multithreaded application.
> 
> Anything we can do to avoid taking the mmap_sem, even for read, helps any
> multithreaded workload.  Your suggestion that "this is rare, it doesn't
> matter" shows that you don't get it.  That you haven't found a workload
> where you can measure it shows that your testing is inadequate.
> 
> Yes, there's added complexity with the per-VMA locks.  But we need it for
> good performance.  Throwing away performance on a very small reduction
> in complexity is a terrible trade-off.

Yes, this is a technical discussion, and such comments are what I'm looking
for.  Thank you.

What I am not sure so far is that what you worried on a performance degrade
for "this small corner case" doesn't exist.

Let's first check when that vmf_anon_prepare() lines are introduced:

commit 17c05f18e54158a3eed0c22c85b7a756b63dcc01
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Mon Feb 27 09:36:25 2023 -0800

    mm: prevent do_swap_page from handling page faults under VMA lock

It didn't seem like a plan to do late anon_vma check for any performance
reasons.

To figure these out, let me ask some more questions.

1) When you said "you ran a customer workload, and that greatly improved
   performance", are you comparing between:

   - with/without file-typed per-vma lock, or,
   - with/without this patch?

   Note that I'm hopefully not touching that fact that file per-vma should
   work like before for the majority.  And I'm surprised to see your
   comment because I didn't expect this is even measured before.

   To ask in another way: do you mean that it's your intention to check
   anon_vma late for private file mappings when working on file support on
   per-vma lock?

   If the answer is yes, I'd say please provide some document patch to
   support such behavior, you can stop my patch from getting merged now,
   but it's never clear whether someone else will see this and post it
   again.  If it wasn't obviously to Suren who introduced per-vma lock [1],
   then I won't be surprised it's unknown to most developers on the list.

2) What happens if lock_vma_under_rcu() keeps spreading its usage?

   Now it's already spread over to the uffd world.  Uffd has that special
   check to make sure file private mappings are fine in lock_vma().

   When it keeps going like that, I won't be surprised to see future users
   of lock_vma_under_rcu() forget the private file mappings and it can
   cause hard to debug issues.  Even if lock_vma_under_rcu() is exactly
   what you're looking for, I think we may need lock_vma_under_rcu_fault(),
   then lock_vma_under_rcu() should cover private file mappings to make
   sure future users don't expose easily to hard to debug issues.

[1] https://lore.kernel.org/r/CAJuCfpGj5xk-NxSwW6Mt8NGZcV9N__8zVPMGXDPAYKMcN9=Oig@mail.gmail.com

Thanks,
Suren Baghdasaryan April 11, 2024, 3:42 p.m. UTC | #8
On Wed, Apr 10, 2024 at 1:26 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Apr 10, 2024 at 01:06:21PM -0400, Peter Xu wrote:
> > anon_vma is a tricky object in the context of per-vma lock, because it's
> > racy to modify it in that context and mmap lock is needed if it's not
> > stable yet.
>
> I object to this commit message.  First, it's not a "sanity check".  It's
> a check to see if we already have an anon VMA.  Second, it's not "racy
> to modify it" at all.  The problem is that we need to look at other
> VMAs, for which we do not hold the lock.
>
> > So the trivial side effect of such patch is:
> >
> >   - We may do slightly better on the first WRITE of a private file mapping,
> >   because we can retry earlier (in lock_vma_under_rcu(), rather than
> >   vmf_anon_prepare() later).
> >
> >   - We may always use mmap lock for the initial READs on a private file
> >   mappings, while before this patch it _can_ (only when no WRITE ever
> >   happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
> >   read fault with per-vma lock.
>
> But that's a super common path!  Look at 'cat /proc/self/maps'.  All
> your program text (including libraries) is mapped PRIVATE, and never
> written to (except by ptrace, I guess).

Uh, indeed I didn't realize this would be the side-effect from this
early check. And that's exactly why I wanted Matthew's input on this
in [1].

>
> NAK this patch.
>
Liam R. Howlett April 11, 2024, 5:13 p.m. UTC | #9
* Peter Xu <peterx@redhat.com> [240410 13:06]:
> anon_vma is a tricky object in the context of per-vma lock, because it's
> racy to modify it in that context and mmap lock is needed if it's not
> stable yet.

How is it racy in your view?  There isn't a way to get in a situation
where its state isn't certain, is there?

> 
> So far there are three places that sanity checks anon_vma for that:
> 
>   - lock_vma_under_rcu(): this is the major entrance of per-vma lock, where
>     we have taken care of anon memory v.s. potential anon_vma allocations.

Well, not exactly.  Single vma faults vs potential modifications beyond
one vma (including uses of adjacent vmas with anon_vmas)

> 
>   - lock_vma(): even if it looks so generic as an API, it's only used in
>     userfaultfd context to leverage per-vma locks.  It does extra check
>     over MAP_PRIVATE file mappings for the same anon_vma issue.

This static function is in mm/userfaultfd so, yes, it's not generic -
the name is generic, but I didn't see a reason to hold up the patch for
a static name that is descriptive.  It's static so I'm not concerned
about usage growing.

I would expect such a check will eventually need to be moved to common
code, and then potentially change the name to something more
descriptive.  This seems premature with a single user though.

> 
>   - vmf_anon_prepare(): it works for private file mapping faults just like
>     what lock_vma() wanted to cover above.  One trivial difference is in
>     some extremely corner case, the fault handler will still allow per-vma
>     fault to happen, like a READ on a privately mapped file.

What is happening here is that we are detecting when the virtual memory
space is stable vs when the vma is stable.  In some cases, when we need
to check prev/next, then we need to make sure the virtual memory space
is stable.  In other cases, the vma is all that matters to the operation
and so we can continue without stopping anyone else from modifying the
virtual memory space - that is, we are allowing writes in other areas.

> 
> The question is whether that's intended to make it as complicated.  Per my
> question in the thread, it is not intended, and Suren also seems to agree [1].
> 
> So the trivial side effect of such patch is:
> 
>   - We may do slightly better on the first WRITE of a private file mapping,
>   because we can retry earlier (in lock_vma_under_rcu(), rather than
>   vmf_anon_prepare() later).
> 
>   - We may always use mmap lock for the initial READs on a private file
>   mappings, while before this patch it _can_ (only when no WRITE ever
>   happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
>   read fault with per-vma lock.
> 
> Then noted that right after any WRITE the anon_vma will be stablized, then
> there will be no difference.

In my view, the anon_vma is always stable.  During a write it is
initialised.

>And I believe that should be the majority
> cases too; I also did try to run a program, writting to MAP_PRIVATE file
> memory (that I pre-headed in the page cache) and I can hardly measure a
> difference in performance.
> 
> Let's simply ignore all those trivial corner cases and unify the anon_vma
> check from three places into one.  I also didn't check the rest users of
> lock_vma_under_rcu(), where in a !fault context it could even fix something
> that used to race with private file mappings but I didn't check further.

This change will increase mmap semaphore contention.  I'd like to move
to a more common structured layout, but I think we need to find a way to
do this without failing the lock_vma_under_rcu() more frequently.  In
fact, we need to be looking for ways to make it fail less.

The UFFD code is more strict on what is acceptable for a per-vma lock
[1].  This user has a restriction that will decrease the benefits of the
per-vma lock, but we shouldn't make this case the common one.

As I'm sure you are aware, the page fault path is the most common path
for using per-vma locks and should minimize taking the mmap lock as much
as possible.

I don't think we should increase the lock contention to simplify the
code.

Thanks,
Liam

[1] https://lore.kernel.org/lkml/CAG48ez0AdTijvuh0xueg_spwNE9tVcPuvqT9WpvmtiNNudQFMw@mail.gmail.com/
Matthew Wilcox April 11, 2024, 5:14 p.m. UTC | #10
On Thu, Apr 11, 2024 at 11:34:44AM -0400, Peter Xu wrote:
>    If the answer is yes, I'd say please provide some document patch to
>    support such behavior, you can stop my patch from getting merged now,
>    but it's never clear whether someone else will see this and post it
>    again.  If it wasn't obviously to Suren who introduced per-vma lock [1],
>    then I won't be surprised it's unknown to most developers on the list.

Anyone touching this path should have a good idea about what is and is
not the common case.  Your confidence greatly exceeded your ability
here.  I will not be submitting such a patch.
Matthew Wilcox April 11, 2024, 9:27 p.m. UTC | #11
On Thu, Apr 11, 2024 at 05:12:02PM -0400, Peter Xu wrote:
> The question is whether that's intended to make it as complicated.  For
> example, why don't we check anon_vma for anonymous too later when prepare
> anon_vma, however we do it late for file memory.  AFAICT there's nothing
> special with file memory in this case.

Yes, it's absolutely intended.  If anything, anon memory is the special
case that checks up-front.

Congratulations on adding additional instructions to the common case.
I don't understand why you persist with your nonsense.  Please stop.
Peter Xu April 11, 2024, 9:46 p.m. UTC | #12
On Thu, Apr 11, 2024 at 10:27:56PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 11, 2024 at 05:12:02PM -0400, Peter Xu wrote:
> > The question is whether that's intended to make it as complicated.  For
> > example, why don't we check anon_vma for anonymous too later when prepare
> > anon_vma, however we do it late for file memory.  AFAICT there's nothing
> > special with file memory in this case.
> 
> Yes, it's absolutely intended.  If anything, anon memory is the special
> case that checks up-front.
> 
> Congratulations on adding additional instructions to the common case.
> I don't understand why you persist with your nonsense.  Please stop.

How many instructions it takes for a late RETRY for WRITEs to private file
mappings, fallback to mmap_sem?

Did you even finish reading the patch at all?
Matthew Wilcox April 11, 2024, 10:02 p.m. UTC | #13
On Thu, Apr 11, 2024 at 05:46:45PM -0400, Peter Xu wrote:
> On Thu, Apr 11, 2024 at 10:27:56PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 11, 2024 at 05:12:02PM -0400, Peter Xu wrote:
> > > The question is whether that's intended to make it as complicated.  For
> > > example, why don't we check anon_vma for anonymous too later when prepare
> > > anon_vma, however we do it late for file memory.  AFAICT there's nothing
> > > special with file memory in this case.
> > 
> > Yes, it's absolutely intended.  If anything, anon memory is the special
> > case that checks up-front.
> > 
> > Congratulations on adding additional instructions to the common case.
> > I don't understand why you persist with your nonsense.  Please stop.
> 
> How many instructions it takes for a late RETRY for WRITEs to private file
> mappings, fallback to mmap_sem?

Doesn't matter.  That happens _once_ per VMA, and it's dwarfed by the
cost of allocating and initialising the COWed page.  You're adding
instructions to every single page fault.  I'm not happy that we had to
add extra instructions to the fault path for single-threaded programs,
but we at least had the justification that we were improving scalability
on large systems.  Your excuse is "it makes the code cleaner".  And
honestly, I don't think it even does that.

> Did you even finish reading the patch at all?

Yes, I read the whole thing.  It's garbage.
Matthew Wilcox April 12, 2024, 3:14 a.m. UTC | #14
On Thu, Apr 11, 2024 at 11:02:32PM +0100, Matthew Wilcox wrote:
> > How many instructions it takes for a late RETRY for WRITEs to private file
> > mappings, fallback to mmap_sem?
> 
> Doesn't matter.  That happens _once_ per VMA, and it's dwarfed by the
> cost of allocating and initialising the COWed page.  You're adding
> instructions to every single page fault.  I'm not happy that we had to
> add extra instructions to the fault path for single-threaded programs,
> but we at least had the justification that we were improving scalability
> on large systems.  Your excuse is "it makes the code cleaner".  And
> honestly, I don't think it even does that.

Suren, what would you think to this?

diff --git a/mm/memory.c b/mm/memory.c
index 6e2fe960473d..e495adcbe968 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
        if (!vma_start_read(vma))
                goto inval;

-       /*
-        * find_mergeable_anon_vma uses adjacent vmas which are not locked.
-        * This check must happen after vma_start_read(); otherwise, a
-        * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
-        * from its anon_vma.
-        */
-       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
-               goto inval_end_read;
-
        /* Check since vm_start/vm_end might change before we lock the VMA */
        if (unlikely(address < vma->vm_start || address >= vma->vm_end))
                goto inval_end_read;

That takes a few insns out of the page fault path (good!) at the cost
of one extra trip around the fault handler for the first fault on an
anon vma.  It makes the file & anon paths more similar to each other
(good!)

We'd need some data to be sure it's really a win, but less code is
always good.

We could even eagerly initialise vma->anon_vma for anon vmas.  I don't
know why we don't do that.
Peter Xu April 12, 2024, 12:38 p.m. UTC | #15
On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> On Thu, Apr 11, 2024 at 11:02:32PM +0100, Matthew Wilcox wrote:
> > > How many instructions it takes for a late RETRY for WRITEs to private file
> > > mappings, fallback to mmap_sem?
> > 
> > Doesn't matter.  That happens _once_ per VMA, and it's dwarfed by the
> > cost of allocating and initialising the COWed page.  You're adding
> > instructions to every single page fault.  I'm not happy that we had to
> > add extra instructions to the fault path for single-threaded programs,
> > but we at least had the justification that we were improving scalability
> > on large systems.  Your excuse is "it makes the code cleaner".  And
> > honestly, I don't think it even does that.
> 
> Suren, what would you think to this?
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e2fe960473d..e495adcbe968 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>         if (!vma_start_read(vma))
>                 goto inval;
> 
> -       /*
> -        * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> -        * This check must happen after vma_start_read(); otherwise, a
> -        * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> -        * from its anon_vma.
> -        */
> -       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> -               goto inval_end_read;
> -
>         /* Check since vm_start/vm_end might change before we lock the VMA */
>         if (unlikely(address < vma->vm_start || address >= vma->vm_end))
>                 goto inval_end_read;
> 
> That takes a few insns out of the page fault path (good!) at the cost
> of one extra trip around the fault handler for the first fault on an
> anon vma.  It makes the file & anon paths more similar to each other
> (good!)
> 
> We'd need some data to be sure it's really a win, but less code is
> always good.

You at least need two things:

  (1) don't throw away Jann's comment so easily

  (2) have a look on whether anon memory has the fallback yet, at all

Maybe someone can already comment in a harsh way on this one, but no, I'm
not going to be like that.

I still don't understand why you don't like so much to not fallback at all
if we could, the flags I checked was all in hot cache I think anyway.

And since I'm also enough on how you comment in your previous replies, I'll
leave the rest comments for others.
Suren Baghdasaryan April 12, 2024, 12:46 p.m. UTC | #16
On Thu, Apr 11, 2024 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Apr 11, 2024 at 11:02:32PM +0100, Matthew Wilcox wrote:
> > > How many instructions it takes for a late RETRY for WRITEs to private file
> > > mappings, fallback to mmap_sem?
> >
> > Doesn't matter.  That happens _once_ per VMA, and it's dwarfed by the
> > cost of allocating and initialising the COWed page.  You're adding
> > instructions to every single page fault.  I'm not happy that we had to
> > add extra instructions to the fault path for single-threaded programs,
> > but we at least had the justification that we were improving scalability
> > on large systems.  Your excuse is "it makes the code cleaner".  And
> > honestly, I don't think it even does that.
>
> Suren, what would you think to this?
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e2fe960473d..e495adcbe968 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>         if (!vma_start_read(vma))
>                 goto inval;
>
> -       /*
> -        * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> -        * This check must happen after vma_start_read(); otherwise, a
> -        * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> -        * from its anon_vma.
> -        */
> -       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> -               goto inval_end_read;
> -
>         /* Check since vm_start/vm_end might change before we lock the VMA */
>         if (unlikely(address < vma->vm_start || address >= vma->vm_end))
>                 goto inval_end_read;
>
> That takes a few insns out of the page fault path (good!) at the cost
> of one extra trip around the fault handler for the first fault on an
> anon vma.  It makes the file & anon paths more similar to each other
> (good!)

I see what you mean. The impact would depend on the workload but in my
earlier tests when developing per-VMA locks there were on average less
than 1% faults which were for anonymous pages and had
vma->anon_vma==NULL. I recorded that after using my desktop for a day
or so and running a series of benchmark tests. Again, that number
might be drastically different on some other workloads.

About the code, I'll take a closer look once I'm back from vacation
this weekend but I think you will also have to modify
do_anonymous_page() to use vmf_anon_prepare() instead of
anon_vma_prepare().

>
> We'd need some data to be sure it's really a win, but less code is
> always good.
>
> We could even eagerly initialise vma->anon_vma for anon vmas.  I don't
> know why we don't do that.

You found the answer to that question a long time ago and IIRC it was
because in many cases we end up not needing to set vma->anon_vma at
all. So, this is an optimization to try avoiding extra operations
whenever we can. I'll try to find your comment on this.
Suren Baghdasaryan April 12, 2024, 1:06 p.m. UTC | #17
On Fri, Apr 12, 2024 at 5:39 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 11, 2024 at 11:02:32PM +0100, Matthew Wilcox wrote:
> > > > How many instructions it takes for a late RETRY for WRITEs to private file
> > > > mappings, fallback to mmap_sem?
> > >
> > > Doesn't matter.  That happens _once_ per VMA, and it's dwarfed by the
> > > cost of allocating and initialising the COWed page.  You're adding
> > > instructions to every single page fault.  I'm not happy that we had to
> > > add extra instructions to the fault path for single-threaded programs,
> > > but we at least had the justification that we were improving scalability
> > > on large systems.  Your excuse is "it makes the code cleaner".  And
> > > honestly, I don't think it even does that.
> >
> > Suren, what would you think to this?
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6e2fe960473d..e495adcbe968 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >         if (!vma_start_read(vma))
> >                 goto inval;
> >
> > -       /*
> > -        * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> > -        * This check must happen after vma_start_read(); otherwise, a
> > -        * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> > -        * from its anon_vma.
> > -        */
> > -       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> > -               goto inval_end_read;
> > -
> >         /* Check since vm_start/vm_end might change before we lock the VMA */
> >         if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> >                 goto inval_end_read;
> >
> > That takes a few insns out of the page fault path (good!) at the cost
> > of one extra trip around the fault handler for the first fault on an
> > anon vma.  It makes the file & anon paths more similar to each other
> > (good!)
> >
> > We'd need some data to be sure it's really a win, but less code is
> > always good.
>
> You at least need two things:
>
>   (1) don't throw away Jann's comment so easily

I agree, if we make this change we should keep this comment and maybe
move it into vmf_anon_prepare()

>
>   (2) have a look on whether anon memory has the fallback yet, at all

Yeah, I think do_anonymous_page() will have to change as I mentioned
in the previous reply.

>
> Maybe someone can already comment in a harsh way on this one, but no, I'm
> not going to be like that.
>
> I still don't understand why you don't like so much to not fallback at all
> if we could, the flags I checked was all in hot cache I think anyway.
>
> And since I'm also enough on how you comment in your previous replies, I'll
> leave the rest comments for others.

FWIW I fully accept the blame for not seeing that private file mapping
read case regression. In retrospect this should have been obvious...
but the hindsight is always 20/20.

>
> --
> Peter Xu
>
Matthew Wilcox April 12, 2024, 1:32 p.m. UTC | #18
On Fri, Apr 12, 2024 at 05:46:52AM -0700, Suren Baghdasaryan wrote:
> On Thu, Apr 11, 2024 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> About the code, I'll take a closer look once I'm back from vacation
> this weekend but I think you will also have to modify
> do_anonymous_page() to use vmf_anon_prepare() instead of
> anon_vma_prepare().

Ah yes.  Also do_huge_pmd_anonymous_page().  And we should do this:

+++ b/mm/rmap.c
@@ -182,8 +182,6 @@ static void anon_vma_chain_link(struct vm_area_struct *vma,
  * for the new allocation. At the same time, we do not want
  * to do any locking for the common case of already having
  * an anon_vma.
- *
- * This must be called with the mmap_lock held for reading.
  */
 int __anon_vma_prepare(struct vm_area_struct *vma)
 {
@@ -191,6 +189,7 @@ int __anon_vma_prepare(struct vm_area_struct *vma)
        struct anon_vma *anon_vma, *allocated;
        struct anon_vma_chain *avc;

+       mmap_assert_locked(mm);
        might_sleep();

        avc = anon_vma_chain_alloc(GFP_KERNEL);

> > We could even eagerly initialise vma->anon_vma for anon vmas.  I don't
> > know why we don't do that.
> 
> You found the answer to that question a long time ago and IIRC it was
> because in many cases we end up not needing to set vma->anon_vma at
> all. So, this is an optimization to try avoiding extra operations
> whenever we can. I'll try to find your comment on this.

I thought that was file VMAs that I found the answer to that question?
Suren Baghdasaryan April 12, 2024, 1:46 p.m. UTC | #19
On Fri, Apr 12, 2024 at 6:32 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 12, 2024 at 05:46:52AM -0700, Suren Baghdasaryan wrote:
> > On Thu, Apr 11, 2024 at 8:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> > About the code, I'll take a closer look once I'm back from vacation
> > this weekend but I think you will also have to modify
> > do_anonymous_page() to use vmf_anon_prepare() instead of
> > anon_vma_prepare().
>
> Ah yes.  Also do_huge_pmd_anonymous_page().  And we should do this:
>
> +++ b/mm/rmap.c
> @@ -182,8 +182,6 @@ static void anon_vma_chain_link(struct vm_area_struct *vma,
>   * for the new allocation. At the same time, we do not want
>   * to do any locking for the common case of already having
>   * an anon_vma.
> - *
> - * This must be called with the mmap_lock held for reading.
>   */
>  int __anon_vma_prepare(struct vm_area_struct *vma)
>  {
> @@ -191,6 +189,7 @@ int __anon_vma_prepare(struct vm_area_struct *vma)
>         struct anon_vma *anon_vma, *allocated;
>         struct anon_vma_chain *avc;
>
> +       mmap_assert_locked(mm);
>         might_sleep();
>
>         avc = anon_vma_chain_alloc(GFP_KERNEL);
>

Yes.

> > > We could even eagerly initialise vma->anon_vma for anon vmas.  I don't
> > > know why we don't do that.
> >
> > You found the answer to that question a long time ago and IIRC it was
> > because in many cases we end up not needing to set vma->anon_vma at
> > all. So, this is an optimization to try avoiding extra operations
> > whenever we can. I'll try to find your comment on this.
>
> I thought that was file VMAs that I found the answer to that question?

I'll try to find that discussion once I get back to my workstation this weekend.
Matthew Wilcox April 12, 2024, 2:16 p.m. UTC | #20
On Fri, Apr 12, 2024 at 06:06:46AM -0700, Suren Baghdasaryan wrote:
> > On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> > > -       /*
> > > -        * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> > > -        * This check must happen after vma_start_read(); otherwise, a
> > > -        * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> > > -        * from its anon_vma.
> > > -        */
> > > -       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> > > -               goto inval_end_read;
> > > -
> > >         /* Check since vm_start/vm_end might change before we lock the VMA */
> > >         if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > >                 goto inval_end_read;
> > >
> > > That takes a few insns out of the page fault path (good!) at the cost
> > > of one extra trip around the fault handler for the first fault on an
> > > anon vma.  It makes the file & anon paths more similar to each other
> > > (good!)
> > >
> > > We'd need some data to be sure it's really a win, but less code is
> > > always good.
> 
> I agree, if we make this change we should keep this comment and maybe
> move it into vmf_anon_prepare()

Most of the comment makes no sense if you move it out of
lock_vma_under_rcu().  It's justifying where it needs to be in that
function.  If it's no longer in that function, there's not much left of
the comment.  What part do you think is valuable and needs to be retained?
Suren Baghdasaryan April 12, 2024, 2:53 p.m. UTC | #21
On Fri, Apr 12, 2024, 9:16 AM Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Apr 12, 2024 at 06:06:46AM -0700, Suren Baghdasaryan wrote:
> > > On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> > > > -       /*
> > > > -        * find_mergeable_anon_vma uses adjacent vmas which are not
> locked.
> > > > -        * This check must happen after vma_start_read(); otherwise,
> a
> > > > -        * concurrent mremap() with MREMAP_DONTUNMAP could
> dissociate the VMA
> > > > -        * from its anon_vma.
> > > > -        */
> > > > -       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> > > > -               goto inval_end_read;
> > > > -
> > > >         /* Check since vm_start/vm_end might change before we lock
> the VMA */
> > > >         if (unlikely(address < vma->vm_start || address >=
> vma->vm_end))
> > > >                 goto inval_end_read;
> > > >
> > > > That takes a few insns out of the page fault path (good!) at the cost
> > > > of one extra trip around the fault handler for the first fault on an
> > > > anon vma.  It makes the file & anon paths more similar to each other
> > > > (good!)
> > > >
> > > > We'd need some data to be sure it's really a win, but less code is
> > > > always good.
> >
> > I agree, if we make this change we should keep this comment and maybe
> > move it into vmf_anon_prepare()
>
> Most of the comment makes no sense if you move it out of
> lock_vma_under_rcu().  It's justifying where it needs to be in that
> function.  If it's no longer in that function, there's not much left of
> the comment.  What part do you think is valuable and needs to be retained?
>

Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
problem for per-vma locks, we should have an explanation there. This
comment would serve that purpose IMO.




>
Matthew Wilcox April 12, 2024, 3:19 p.m. UTC | #22
On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> problem for per-vma locks, we should have an explanation there. This
> comment would serve that purpose IMO.

I'll do you one better; here's some nice kernel-doc for
vmd_anon_prepare():

commit f89a1cd17f13
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Apr 12 10:41:02 2024 -0400

    mm: Delay the check for a NULL anon_vma
    
    Instead of checking the anon_vma early in the fault path where all page
    faults pay the cost, delay it until we know we're going to need the
    anon_vma to be filled in.  This will have a slight negative effect on the
    first fault in an anonymous VMA, but it shortens every other page fault.
    It also makes the code slightly cleaner as the anon and file backed
    fault handling look more similar.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8d2ed80b0bf..718f91f74a48 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1057,11 +1057,13 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	gfp_t gfp;
 	struct folio *folio;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	vm_fault_t ret;
 
 	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
 		return VM_FAULT_FALLBACK;
-	if (unlikely(anon_vma_prepare(vma)))
-		return VM_FAULT_OOM;
+	ret = vmf_anon_prepare(vmf);
+	if (ret)
+		return ret;
 	khugepaged_enter_vma(vma, vma->vm_flags);
 
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
diff --git a/mm/memory.c b/mm/memory.c
index 6e2fe960473d..46b509c3bbc1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3213,6 +3213,21 @@ static inline vm_fault_t vmf_can_call_fault(const struct vm_fault *vmf)
 	return VM_FAULT_RETRY;
 }
 
+/**
+ * vmf_anon_prepare - Prepare to handle an anonymous fault.
+ * @vmf: The vm_fault descriptor passed from the fault handler.
+ *
+ * When preparing to insert an anonymous page into a VMA from a
+ * fault handler, call this function rather than anon_vma_prepare().
+ * If this vma does not already have an associated anon_vma and we are
+ * only protected by the per-VMA lock, the caller must retry with the
+ * mmap_lock held.  __anon_vma_prepare() will look at adjacent VMAs to
+ * determine if this VMA can share its anon_vma, and that's not safe to
+ * do with only the per-VMA lock held for this VMA.
+ *
+ * Return: 0 if fault handling can proceed.  Any other value should be
+ * returned to the caller.
+ */
 vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -4437,8 +4452,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	}
 
 	/* Allocate our own private page. */
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
+	ret = vmf_anon_prepare(vmf);
+	if (ret)
+		return ret;
 	/* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
 	folio = alloc_anon_folio(vmf);
 	if (IS_ERR(folio))
@@ -5821,15 +5837,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	if (!vma_start_read(vma))
 		goto inval;
 
-	/*
-	 * find_mergeable_anon_vma uses adjacent vmas which are not locked.
-	 * This check must happen after vma_start_read(); otherwise, a
-	 * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
-	 * from its anon_vma.
-	 */
-	if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
-		goto inval_end_read;
-
 	/* Check since vm_start/vm_end might change before we lock the VMA */
 	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
 		goto inval_end_read;
Matthew Wilcox April 12, 2024, 3:31 p.m. UTC | #23
On Fri, Apr 12, 2024 at 04:19:27PM +0100, Matthew Wilcox wrote:
> On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> > Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> > problem for per-vma locks, we should have an explanation there. This
> > comment would serve that purpose IMO.
> 
> I'll do you one better; here's some nice kernel-doc for
> vmd_anon_prepare():

And here's a followup patch to fix some minor issues in uffd.

 - Rename lock_vma() to uffd_lock_vma() because it really is uffd
   specific.
 - Remove comment referencing unlock_vma() which doesn't exist.
 - Fix the comment about lock_vma_under_rcu() which I just made
   incorrect.

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index f6267afe65d1..a32171158c38 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -56,17 +56,16 @@ struct vm_area_struct *find_vma_and_prepare_anon(struct mm_struct *mm,
 
 #ifdef CONFIG_PER_VMA_LOCK
 /*
- * lock_vma() - Lookup and lock vma corresponding to @address.
+ * uffd_lock_vma() - Lookup and lock vma corresponding to @address.
  * @mm: mm to search vma in.
  * @address: address that the vma should contain.
  *
- * Should be called without holding mmap_lock. vma should be unlocked after use
- * with unlock_vma().
+ * Should be called without holding mmap_lock.
  *
  * Return: A locked vma containing @address, -ENOENT if no vma is found, or
  * -ENOMEM if anon_vma couldn't be allocated.
  */
-static struct vm_area_struct *lock_vma(struct mm_struct *mm,
+static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
 				       unsigned long address)
 {
 	struct vm_area_struct *vma;
@@ -74,9 +73,8 @@ static struct vm_area_struct *lock_vma(struct mm_struct *mm,
 	vma = lock_vma_under_rcu(mm, address);
 	if (vma) {
 		/*
-		 * lock_vma_under_rcu() only checks anon_vma for private
-		 * anonymous mappings. But we need to ensure it is assigned in
-		 * private file-backed vmas as well.
+		 * We know we're going to need to use anon_vma, so check
+		 * that early.
 		 */
 		if (!(vma->vm_flags & VM_SHARED) && unlikely(!vma->anon_vma))
 			vma_end_read(vma);
@@ -107,7 +105,7 @@ static struct vm_area_struct *uffd_mfill_lock(struct mm_struct *dst_mm,
 {
 	struct vm_area_struct *dst_vma;
 
-	dst_vma = lock_vma(dst_mm, dst_start);
+	dst_vma = uffd_lock_vma(dst_mm, dst_start);
 	if (IS_ERR(dst_vma) || validate_dst_vma(dst_vma, dst_start + len))
 		return dst_vma;
 
@@ -1437,7 +1435,7 @@ static int uffd_move_lock(struct mm_struct *mm,
 	struct vm_area_struct *vma;
 	int err;
 
-	vma = lock_vma(mm, dst_start);
+	vma = uffd_lock_vma(mm, dst_start);
 	if (IS_ERR(vma))
 		return PTR_ERR(vma);
 
@@ -1452,7 +1450,7 @@ static int uffd_move_lock(struct mm_struct *mm,
 	}
 
 	/*
-	 * Using lock_vma() to get src_vma can lead to following deadlock:
+	 * Using uffd_lock_vma() to get src_vma can lead to following deadlock:
 	 *
 	 * Thread1				Thread2
 	 * -------				-------
@@ -1474,7 +1472,7 @@ static int uffd_move_lock(struct mm_struct *mm,
 	err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
 	if (!err) {
 		/*
-		 * See comment in lock_vma() as to why not using
+		 * See comment in uffd_lock_vma() as to why not using
 		 * vma_start_read() here.
 		 */
 		down_read(&(*dst_vmap)->vm_lock->lock);
Suren Baghdasaryan April 13, 2024, 9:41 p.m. UTC | #24
On Fri, Apr 12, 2024 at 8:19 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> > Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> > problem for per-vma locks, we should have an explanation there. This
> > comment would serve that purpose IMO.
>
> I'll do you one better; here's some nice kernel-doc for
> vmd_anon_prepare():
>
> commit f89a1cd17f13
> Author: Matthew Wilcox (Oracle) <willy@infradead.org>
> Date:   Fri Apr 12 10:41:02 2024 -0400
>
>     mm: Delay the check for a NULL anon_vma
>
>     Instead of checking the anon_vma early in the fault path where all page
>     faults pay the cost, delay it until we know we're going to need the
>     anon_vma to be filled in.  This will have a slight negative effect on the
>     first fault in an anonymous VMA, but it shortens every other page fault.
>     It also makes the code slightly cleaner as the anon and file backed
>     fault handling look more similar.
>
>     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

The patch looks good to me but I would like to run benchmark tests to
see how it affects different workloads (both positive and negative
effects). I'll try to do that in this coming week and will report if I
find any considerable difference. For now:

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Also we should ensure that this patch goes together with the next one
adjusting related code in uffd. It would be better to resend them as
one patchset to avoid confusion.

>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d8d2ed80b0bf..718f91f74a48 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1057,11 +1057,13 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>         gfp_t gfp;
>         struct folio *folio;
>         unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> +       vm_fault_t ret;
>
>         if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
>                 return VM_FAULT_FALLBACK;
> -       if (unlikely(anon_vma_prepare(vma)))
> -               return VM_FAULT_OOM;
> +       ret = vmf_anon_prepare(vmf);
> +       if (ret)
> +               return ret;
>         khugepaged_enter_vma(vma, vma->vm_flags);
>
>         if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e2fe960473d..46b509c3bbc1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3213,6 +3213,21 @@ static inline vm_fault_t vmf_can_call_fault(const struct vm_fault *vmf)
>         return VM_FAULT_RETRY;
>  }
>
> +/**
> + * vmf_anon_prepare - Prepare to handle an anonymous fault.
> + * @vmf: The vm_fault descriptor passed from the fault handler.
> + *
> + * When preparing to insert an anonymous page into a VMA from a
> + * fault handler, call this function rather than anon_vma_prepare().
> + * If this vma does not already have an associated anon_vma and we are
> + * only protected by the per-VMA lock, the caller must retry with the
> + * mmap_lock held.  __anon_vma_prepare() will look at adjacent VMAs to
> + * determine if this VMA can share its anon_vma, and that's not safe to
> + * do with only the per-VMA lock held for this VMA.
> + *
> + * Return: 0 if fault handling can proceed.  Any other value should be
> + * returned to the caller.
> + */
>  vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
>  {
>         struct vm_area_struct *vma = vmf->vma;
> @@ -4437,8 +4452,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         }
>
>         /* Allocate our own private page. */
> -       if (unlikely(anon_vma_prepare(vma)))
> -               goto oom;
> +       ret = vmf_anon_prepare(vmf);
> +       if (ret)
> +               return ret;
>         /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
>         folio = alloc_anon_folio(vmf);
>         if (IS_ERR(folio))
> @@ -5821,15 +5837,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>         if (!vma_start_read(vma))
>                 goto inval;
>
> -       /*
> -        * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> -        * This check must happen after vma_start_read(); otherwise, a
> -        * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> -        * from its anon_vma.
> -        */
> -       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> -               goto inval_end_read;
> -
>         /* Check since vm_start/vm_end might change before we lock the VMA */
>         if (unlikely(address < vma->vm_start || address >= vma->vm_end))
>                 goto inval_end_read;
Suren Baghdasaryan April 13, 2024, 9:46 p.m. UTC | #25
On Fri, Apr 12, 2024 at 8:31 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 12, 2024 at 04:19:27PM +0100, Matthew Wilcox wrote:
> > On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> > > Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> > > problem for per-vma locks, we should have an explanation there. This
> > > comment would serve that purpose IMO.
> >
> > I'll do you one better; here's some nice kernel-doc for
> > vmd_anon_prepare():
>
> And here's a followup patch to fix some minor issues in uffd.
>
>  - Rename lock_vma() to uffd_lock_vma() because it really is uffd
>    specific.

I'm planning to expand the scope of lock_vma() and reuse it for
/proc/pid/maps reading under per-VMA locks. No objection to renaming
it for now but I'll likely rename it back later once it's used in more
places.

>  - Remove comment referencing unlock_vma() which doesn't exist.
>  - Fix the comment about lock_vma_under_rcu() which I just made
>    incorrect.
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index f6267afe65d1..a32171158c38 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -56,17 +56,16 @@ struct vm_area_struct *find_vma_and_prepare_anon(struct mm_struct *mm,
>
>  #ifdef CONFIG_PER_VMA_LOCK
>  /*
> - * lock_vma() - Lookup and lock vma corresponding to @address.
> + * uffd_lock_vma() - Lookup and lock vma corresponding to @address.
>   * @mm: mm to search vma in.
>   * @address: address that the vma should contain.
>   *
> - * Should be called without holding mmap_lock. vma should be unlocked after use
> - * with unlock_vma().
> + * Should be called without holding mmap_lock.
>   *
>   * Return: A locked vma containing @address, -ENOENT if no vma is found, or
>   * -ENOMEM if anon_vma couldn't be allocated.
>   */
> -static struct vm_area_struct *lock_vma(struct mm_struct *mm,
> +static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
>                                        unsigned long address)
>  {
>         struct vm_area_struct *vma;
> @@ -74,9 +73,8 @@ static struct vm_area_struct *lock_vma(struct mm_struct *mm,
>         vma = lock_vma_under_rcu(mm, address);
>         if (vma) {
>                 /*
> -                * lock_vma_under_rcu() only checks anon_vma for private
> -                * anonymous mappings. But we need to ensure it is assigned in
> -                * private file-backed vmas as well.
> +                * We know we're going to need to use anon_vma, so check
> +                * that early.
>                  */
>                 if (!(vma->vm_flags & VM_SHARED) && unlikely(!vma->anon_vma))
>                         vma_end_read(vma);
> @@ -107,7 +105,7 @@ static struct vm_area_struct *uffd_mfill_lock(struct mm_struct *dst_mm,
>  {
>         struct vm_area_struct *dst_vma;
>
> -       dst_vma = lock_vma(dst_mm, dst_start);
> +       dst_vma = uffd_lock_vma(dst_mm, dst_start);
>         if (IS_ERR(dst_vma) || validate_dst_vma(dst_vma, dst_start + len))
>                 return dst_vma;
>
> @@ -1437,7 +1435,7 @@ static int uffd_move_lock(struct mm_struct *mm,
>         struct vm_area_struct *vma;
>         int err;
>
> -       vma = lock_vma(mm, dst_start);
> +       vma = uffd_lock_vma(mm, dst_start);
>         if (IS_ERR(vma))
>                 return PTR_ERR(vma);
>
> @@ -1452,7 +1450,7 @@ static int uffd_move_lock(struct mm_struct *mm,
>         }
>
>         /*
> -        * Using lock_vma() to get src_vma can lead to following deadlock:
> +        * Using uffd_lock_vma() to get src_vma can lead to following deadlock:
>          *
>          * Thread1                              Thread2
>          * -------                              -------
> @@ -1474,7 +1472,7 @@ static int uffd_move_lock(struct mm_struct *mm,
>         err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
>         if (!err) {
>                 /*
> -                * See comment in lock_vma() as to why not using
> +                * See comment in uffd_lock_vma() as to why not using
>                  * vma_start_read() here.
>                  */
>                 down_read(&(*dst_vmap)->vm_lock->lock);
Matthew Wilcox April 13, 2024, 10:46 p.m. UTC | #26
On Sat, Apr 13, 2024 at 02:41:55PM -0700, Suren Baghdasaryan wrote:
> The patch looks good to me but I would like to run benchmark tests to
> see how it affects different workloads (both positive and negative
> effects). I'll try to do that in this coming week and will report if I
> find any considerable difference. For now:
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> 
> Also we should ensure that this patch goes together with the next one
> adjusting related code in uffd. It would be better to resend them as
> one patchset to avoid confusion.

I've put them here:

http://git.infradead.org/?p=users/willy/pagecache.git;a=shortlog;h=refs/heads/vma-lock

0day has already run build tests, and I'm assured that they'll run perf
tests in the next few days.
Matthew Wilcox April 13, 2024, 10:52 p.m. UTC | #27
On Sat, Apr 13, 2024 at 02:46:56PM -0700, Suren Baghdasaryan wrote:
> On Fri, Apr 12, 2024 at 8:31 AM Matthew Wilcox <willy@infradead.org> wrote:
> >  - Rename lock_vma() to uffd_lock_vma() because it really is uffd
> >    specific.
> 
> I'm planning to expand the scope of lock_vma() and reuse it for
> /proc/pid/maps reading under per-VMA locks. No objection to renaming
> it for now but I'll likely rename it back later once it's used in more
> places.

That would seem like a mistake.  The uffd lock_vma() will create an
anon_vma for VMAs that don't have one, and you wouldn't want that.
It seems to me that lock_vma_under_rcu() does everything you want except
the fallback to mmap_read_lock().  And I'm not sure there's a good way
to package that up ... indeed, I don't see why you'd want the "take
the mmap_lock, look up the VMA, drop the mmap read lock" part at all --
once you've got the mmap_lock, just hold it until you're done.
Suren Baghdasaryan April 13, 2024, 11:11 p.m. UTC | #28
On Sat, Apr 13, 2024 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Apr 13, 2024 at 02:46:56PM -0700, Suren Baghdasaryan wrote:
> > On Fri, Apr 12, 2024 at 8:31 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >  - Rename lock_vma() to uffd_lock_vma() because it really is uffd
> > >    specific.
> >
> > I'm planning to expand the scope of lock_vma() and reuse it for
> > /proc/pid/maps reading under per-VMA locks. No objection to renaming
> > it for now but I'll likely rename it back later once it's used in more
> > places.
>
> That would seem like a mistake.  The uffd lock_vma() will create an
> anon_vma for VMAs that don't have one, and you wouldn't want that.
> It seems to me that lock_vma_under_rcu() does everything you want except
> the fallback to mmap_read_lock().  And I'm not sure there's a good way
> to package that up ... indeed, I don't see why you'd want the "take
> the mmap_lock, look up the VMA, drop the mmap read lock" part at all --
> once you've got the mmap_lock, just hold it until you're done.

Yeah, you are right about anon_vma creation. I definitely don't want
that part when reading maps files.

Not sure about holding mmap_lock until I'm done. The goal of that
patch is to minimize blocking of any modifications while we are
reading maps files, so locking smaller parts might still make sense.
But it would be hard to argue one way or another without any data.
Suren Baghdasaryan April 15, 2024, 3:58 p.m. UTC | #29
On Fri, Apr 12, 2024 at 8:19 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> > Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> > problem for per-vma locks, we should have an explanation there. This
> > comment would serve that purpose IMO.
>
> I'll do you one better; here's some nice kernel-doc for
> vmd_anon_prepare():

I was looking at the find_tcp_vma(), which seems to be the only other
place where lock_vma_under_rcu() is currently used. I think it's used
there only for file-backed pages, so I don't think your change affects
that usecase but this makes me think that we should have some kind of
a warning for lock_vma_under_rcu() future users... Maybe your addition
of mmap_assert_locked() inside __anon_vma_prepare() is enough. Please
don't forget to include that assertion into your final patch.


>
> commit f89a1cd17f13
> Author: Matthew Wilcox (Oracle) <willy@infradead.org>
> Date:   Fri Apr 12 10:41:02 2024 -0400
>
>     mm: Delay the check for a NULL anon_vma
>
>     Instead of checking the anon_vma early in the fault path where all page
>     faults pay the cost, delay it until we know we're going to need the
>     anon_vma to be filled in.  This will have a slight negative effect on the
>     first fault in an anonymous VMA, but it shortens every other page fault.
>     It also makes the code slightly cleaner as the anon and file backed
>     fault handling look more similar.
>
>     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d8d2ed80b0bf..718f91f74a48 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1057,11 +1057,13 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>         gfp_t gfp;
>         struct folio *folio;
>         unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> +       vm_fault_t ret;
>
>         if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
>                 return VM_FAULT_FALLBACK;
> -       if (unlikely(anon_vma_prepare(vma)))
> -               return VM_FAULT_OOM;
> +       ret = vmf_anon_prepare(vmf);
> +       if (ret)
> +               return ret;
>         khugepaged_enter_vma(vma, vma->vm_flags);
>
>         if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e2fe960473d..46b509c3bbc1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3213,6 +3213,21 @@ static inline vm_fault_t vmf_can_call_fault(const struct vm_fault *vmf)
>         return VM_FAULT_RETRY;
>  }
>
> +/**
> + * vmf_anon_prepare - Prepare to handle an anonymous fault.
> + * @vmf: The vm_fault descriptor passed from the fault handler.
> + *
> + * When preparing to insert an anonymous page into a VMA from a
> + * fault handler, call this function rather than anon_vma_prepare().
> + * If this vma does not already have an associated anon_vma and we are
> + * only protected by the per-VMA lock, the caller must retry with the
> + * mmap_lock held.  __anon_vma_prepare() will look at adjacent VMAs to
> + * determine if this VMA can share its anon_vma, and that's not safe to
> + * do with only the per-VMA lock held for this VMA.
> + *
> + * Return: 0 if fault handling can proceed.  Any other value should be
> + * returned to the caller.
> + */
>  vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
>  {
>         struct vm_area_struct *vma = vmf->vma;
> @@ -4437,8 +4452,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         }
>
>         /* Allocate our own private page. */
> -       if (unlikely(anon_vma_prepare(vma)))
> -               goto oom;
> +       ret = vmf_anon_prepare(vmf);
> +       if (ret)
> +               return ret;
>         /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
>         folio = alloc_anon_folio(vmf);
>         if (IS_ERR(folio))
> @@ -5821,15 +5837,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>         if (!vma_start_read(vma))
>                 goto inval;
>
> -       /*
> -        * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> -        * This check must happen after vma_start_read(); otherwise, a
> -        * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> -        * from its anon_vma.
> -        */
> -       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> -               goto inval_end_read;
> -
>         /* Check since vm_start/vm_end might change before we lock the VMA */
>         if (unlikely(address < vma->vm_start || address >= vma->vm_end))
>                 goto inval_end_read;
Matthew Wilcox April 15, 2024, 4:13 p.m. UTC | #30
On Mon, Apr 15, 2024 at 08:58:28AM -0700, Suren Baghdasaryan wrote:
> On Fri, Apr 12, 2024 at 8:19 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> > > Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> > > problem for per-vma locks, we should have an explanation there. This
> > > comment would serve that purpose IMO.
> >
> > I'll do you one better; here's some nice kernel-doc for
> > vmd_anon_prepare():
> 
> I was looking at the find_tcp_vma(), which seems to be the only other
> place where lock_vma_under_rcu() is currently used. I think it's used
> there only for file-backed pages, so I don't think your change affects
> that usecase but this makes me think that we should have some kind of
> a warning for lock_vma_under_rcu() future users... Maybe your addition
> of mmap_assert_locked() inside __anon_vma_prepare() is enough. Please
> don't forget to include that assertion into your final patch.

That's patch 1/3 on the git branch I pointed you to.

The tcp vma is not file backed, but I'm pretty sure that COW is not
something they want, so there's never an anon_vma.  It's for pages
that contain received TCP packets; ie it's mmaped TCP.
Suren Baghdasaryan April 15, 2024, 4:19 p.m. UTC | #31
On Mon, Apr 15, 2024 at 9:13 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Apr 15, 2024 at 08:58:28AM -0700, Suren Baghdasaryan wrote:
> > On Fri, Apr 12, 2024 at 8:19 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> > > > Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> > > > problem for per-vma locks, we should have an explanation there. This
> > > > comment would serve that purpose IMO.
> > >
> > > I'll do you one better; here's some nice kernel-doc for
> > > vmd_anon_prepare():
> >
> > I was looking at the find_tcp_vma(), which seems to be the only other
> > place where lock_vma_under_rcu() is currently used. I think it's used
> > there only for file-backed pages, so I don't think your change affects
> > that usecase but this makes me think that we should have some kind of
> > a warning for lock_vma_under_rcu() future users... Maybe your addition
> > of mmap_assert_locked() inside __anon_vma_prepare() is enough. Please
> > don't forget to include that assertion into your final patch.
>
> That's patch 1/3 on the git branch I pointed you to.

Ah, good!

>
> The tcp vma is not file backed, but I'm pretty sure that COW is not
> something they want, so there's never an anon_vma.  It's for pages
> that contain received TCP packets; ie it's mmaped TCP.

I was following
tcp_zerocopy_receive()->tcp_zerocopy_vm_insert_batch()->vm_insert_pages()->insert_page_in_batch_locked()->validate_page_before_insert()
which errors out for PageAnon(page). So, I assumed this path works on
file-backed pages but I'm not familiar with this code at all.
Matthew Wilcox April 15, 2024, 4:26 p.m. UTC | #32
On Mon, Apr 15, 2024 at 09:19:06AM -0700, Suren Baghdasaryan wrote:
> On Mon, Apr 15, 2024 at 9:13 AM Matthew Wilcox <willy@infradead.org> wrote:
> > The tcp vma is not file backed, but I'm pretty sure that COW is not
> > something they want, so there's never an anon_vma.  It's for pages
> > that contain received TCP packets; ie it's mmaped TCP.
> 
> I was following
> tcp_zerocopy_receive()->tcp_zerocopy_vm_insert_batch()->vm_insert_pages()->insert_page_in_batch_locked()->validate_page_before_insert()
> which errors out for PageAnon(page). So, I assumed this path works on
> file-backed pages but I'm not familiar with this code at all.

It turns out there are pages which are neither file nor anon ;-)
I have a partial list here:

https://kernelnewbies.org/MatthewWilcox/Memdescs

but I don't even have TCP on that list.  I haven't looked into what they
need -- I don't think they need mapping, index, etc, but I need to
figure that out.
Matthew Wilcox April 26, 2024, 2 p.m. UTC | #33
On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> Suren, what would you think to this?
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e2fe960473d..e495adcbe968 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>         if (!vma_start_read(vma))
>                 goto inval;
> 
> -       /*
> -        * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> -        * This check must happen after vma_start_read(); otherwise, a
> -        * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> -        * from its anon_vma.
> -        */
> -       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> -               goto inval_end_read;
> -
>         /* Check since vm_start/vm_end might change before we lock the VMA */
>         if (unlikely(address < vma->vm_start || address >= vma->vm_end))
>                 goto inval_end_read;
> 
> That takes a few insns out of the page fault path (good!) at the cost
> of one extra trip around the fault handler for the first fault on an
> anon vma.  It makes the file & anon paths more similar to each other
> (good!)
> 
> We'd need some data to be sure it's really a win, but less code is
> always good.

Intel's 0day got back to me with data and it's ridiculously good.
Headline figure: over 3x throughput improvement with vm-scalability
https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/

I can't see why it's that good.  It shouldn't be that good.  I'm
seeing big numbers here:

      4366 ±  2%    +565.6%      29061        perf-stat.overall.cycles-between-cache-misses

and the code being deleted is only checking vma->vm_ops and
vma->anon_vma.  Surely that cache line is referenced so frequently
during pagefault that deleting a reference here will make no difference
at all?

We've clearly got an inlining change.  viz:

     72.57           -72.6        0.00        perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
     73.28           -72.6        0.70        perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
     72.55           -72.5        0.00        perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
     69.93           -69.9        0.00        perf-profile.calltrace.cycles-pp.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
     69.12           -69.1        0.00        perf-profile.calltrace.cycles-pp.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
     68.78           -68.8        0.00        perf-profile.calltrace.cycles-pp.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault
     65.78           -65.8        0.00        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault
     65.43           -65.4        0.00        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma

     11.22           +86.5       97.68        perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
     11.14           +86.5       97.66        perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
      3.17 ±  2%     +94.0       97.12        perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
      3.45 ±  2%     +94.1       97.59        perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
      0.00           +98.2       98.15        perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.00           +98.2       98.16        perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe

so maybe the compiler has been able to eliminate some loads from
contended cachelines?

    703147           -87.6%      87147 ±  2%  perf-stat.ps.context-switches
    663.67 ±  5%   +7551.9%      50783        vm-scalability.time.involuntary_context_switches
 1.105e+08           -86.7%   14697764 ±  2%  vm-scalability.time.voluntary_context_switches

indicates to me that we're taking the mmap rwsem far less often (those
would be accounted as voluntary context switches).

So maybe the cache miss reduction is a consequence of just running for
longer before being preempted.

I still don't understand why we have to take the mmap_sem less often.
Is there perhaps a VMA for which we have a NULL vm_ops, but don't set
an anon_vma on a page fault?
Suren Baghdasaryan April 26, 2024, 3:07 p.m. UTC | #34
On Fri, Apr 26, 2024 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> > Suren, what would you think to this?
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6e2fe960473d..e495adcbe968 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >         if (!vma_start_read(vma))
> >                 goto inval;
> >
> > -       /*
> > -        * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> > -        * This check must happen after vma_start_read(); otherwise, a
> > -        * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> > -        * from its anon_vma.
> > -        */
> > -       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> > -               goto inval_end_read;
> > -
> >         /* Check since vm_start/vm_end might change before we lock the VMA */
> >         if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> >                 goto inval_end_read;
> >
> > That takes a few insns out of the page fault path (good!) at the cost
> > of one extra trip around the fault handler for the first fault on an
> > anon vma.  It makes the file & anon paths more similar to each other
> > (good!)
> >
> > We'd need some data to be sure it's really a win, but less code is
> > always good.
>
> Intel's 0day got back to me with data and it's ridiculously good.
> Headline figure: over 3x throughput improvement with vm-scalability
> https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/
>
> I can't see why it's that good.  It shouldn't be that good.  I'm
> seeing big numbers here:
>
>       4366 ą  2%    +565.6%      29061        perf-stat.overall.cycles-between-cache-misses
>
> and the code being deleted is only checking vma->vm_ops and
> vma->anon_vma.  Surely that cache line is referenced so frequently
> during pagefault that deleting a reference here will make no difference
> at all?

That indeed looks overly good. Sorry, I didn't have a chance to run
the benchmarks on my side yet because of the ongoing Android bootcamp
this week.

>
> We've clearly got an inlining change.  viz:
>
>      72.57           -72.6        0.00        perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
>      73.28           -72.6        0.70        perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
>      72.55           -72.5        0.00        perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>      69.93           -69.9        0.00        perf-profile.calltrace.cycles-pp.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>      69.12           -69.1        0.00        perf-profile.calltrace.cycles-pp.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
>      68.78           -68.8        0.00        perf-profile.calltrace.cycles-pp.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault
>      65.78           -65.8        0.00        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault
>      65.43           -65.4        0.00        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma
>
>      11.22           +86.5       97.68        perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      11.14           +86.5       97.66        perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
>       3.17 ą  2%     +94.0       97.12        perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
>       3.45 ą  2%     +94.1       97.59        perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
>       0.00           +98.2       98.15        perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.00           +98.2       98.16        perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
>
> so maybe the compiler has been able to eliminate some loads from
> contended cachelines?
>
>     703147           -87.6%      87147 ą  2%  perf-stat.ps.context-switches
>     663.67 ą  5%   +7551.9%      50783        vm-scalability.time.involuntary_context_switches
>  1.105e+08           -86.7%   14697764 ą  2%  vm-scalability.time.voluntary_context_switches
>
> indicates to me that we're taking the mmap rwsem far less often (those
> would be accounted as voluntary context switches).
>
> So maybe the cache miss reduction is a consequence of just running for
> longer before being preempted.
>
> I still don't understand why we have to take the mmap_sem less often.
> Is there perhaps a VMA for which we have a NULL vm_ops, but don't set
> an anon_vma on a page fault?

I think the only path in either do_anonymous_page() or
do_huge_pmd_anonymous_page() that skips calling anon_vma_prepare() is
the "Use the zero-page for reads" here:
https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L4265. I
didn't look into this particular benchmark yet but will try it out
once I have some time to benchmark your change.

>
Matthew Wilcox April 26, 2024, 3:28 p.m. UTC | #35
On Fri, Apr 26, 2024 at 08:07:45AM -0700, Suren Baghdasaryan wrote:
> On Fri, Apr 26, 2024 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote:
> > Intel's 0day got back to me with data and it's ridiculously good.
> > Headline figure: over 3x throughput improvement with vm-scalability
> > https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/
> >
> > I can't see why it's that good.  It shouldn't be that good.  I'm
> > seeing big numbers here:
> >
> >       4366 ą  2%    +565.6%      29061        perf-stat.overall.cycles-between-cache-misses
> >
> > and the code being deleted is only checking vma->vm_ops and
> > vma->anon_vma.  Surely that cache line is referenced so frequently
> > during pagefault that deleting a reference here will make no difference
> > at all?
> 
> That indeed looks overly good. Sorry, I didn't have a chance to run
> the benchmarks on my side yet because of the ongoing Android bootcamp
> this week.

No problem.  Darn work getting in the way of having fun ;-)

> > I still don't understand why we have to take the mmap_sem less often.
> > Is there perhaps a VMA for which we have a NULL vm_ops, but don't set
> > an anon_vma on a page fault?
> 
> I think the only path in either do_anonymous_page() or
> do_huge_pmd_anonymous_page() that skips calling anon_vma_prepare() is
> the "Use the zero-page for reads" here:
> https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L4265. I
> didn't look into this particular benchmark yet but will try it out
> once I have some time to benchmark your change.

Yes, Liam and I had just brainstormed that as being a plausible
explanation too.  I don't know how frequent it is to use anon memory
read-only.  Presumably it must happen often enough that we've bothered
to implement the zero-page optimisation.  But probably not nearly as
often as this benchmark makes it happen ;-)
Suren Baghdasaryan April 26, 2024, 3:32 p.m. UTC | #36
On Fri, Apr 26, 2024 at 8:28 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 26, 2024 at 08:07:45AM -0700, Suren Baghdasaryan wrote:
> > On Fri, Apr 26, 2024 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > Intel's 0day got back to me with data and it's ridiculously good.
> > > Headline figure: over 3x throughput improvement with vm-scalability
> > > https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/
> > >
> > > I can't see why it's that good.  It shouldn't be that good.  I'm
> > > seeing big numbers here:
> > >
> > >       4366 ą  2%    +565.6%      29061        perf-stat.overall.cycles-between-cache-misses
> > >
> > > and the code being deleted is only checking vma->vm_ops and
> > > vma->anon_vma.  Surely that cache line is referenced so frequently
> > > during pagefault that deleting a reference here will make no difference
> > > at all?
> >
> > That indeed looks overly good. Sorry, I didn't have a chance to run
> > the benchmarks on my side yet because of the ongoing Android bootcamp
> > this week.
>
> No problem.  Darn work getting in the way of having fun ;-)
>
> > > I still don't understand why we have to take the mmap_sem less often.
> > > Is there perhaps a VMA for which we have a NULL vm_ops, but don't set
> > > an anon_vma on a page fault?
> >
> > I think the only path in either do_anonymous_page() or
> > do_huge_pmd_anonymous_page() that skips calling anon_vma_prepare() is
> > the "Use the zero-page for reads" here:
> > https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L4265. I
> > didn't look into this particular benchmark yet but will try it out
> > once I have some time to benchmark your change.
>
> Yes, Liam and I had just brainstormed that as being a plausible
> explanation too.  I don't know how frequent it is to use anon memory
> read-only.  Presumably it must happen often enough that we've bothered
> to implement the zero-page optimisation.  But probably not nearly as
> often as this benchmark makes it happen ;-)

I also wonder if some of this improvement can be attributed to the
last patch in your series
(https://lore.kernel.org/all/20240426144506.1290619-5-willy@infradead.org/).
I assume it was included in the 0day testing?
Liam R. Howlett April 26, 2024, 3:32 p.m. UTC | #37
* Suren Baghdasaryan <surenb@google.com> [240426 11:08]:
> On Fri, Apr 26, 2024 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> > > Suren, what would you think to this?
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6e2fe960473d..e495adcbe968 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > >         if (!vma_start_read(vma))
> > >                 goto inval;
> > >
> > > -       /*
> > > -        * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> > > -        * This check must happen after vma_start_read(); otherwise, a
> > > -        * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> > > -        * from its anon_vma.
> > > -        */
> > > -       if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> > > -               goto inval_end_read;
> > > -
> > >         /* Check since vm_start/vm_end might change before we lock the VMA */
> > >         if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > >                 goto inval_end_read;
> > >
> > > That takes a few insns out of the page fault path (good!) at the cost
> > > of one extra trip around the fault handler for the first fault on an
> > > anon vma.  It makes the file & anon paths more similar to each other
> > > (good!)
> > >
> > > We'd need some data to be sure it's really a win, but less code is
> > > always good.
> >
> > Intel's 0day got back to me with data and it's ridiculously good.
> > Headline figure: over 3x throughput improvement with vm-scalability
> > https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/
> >
> > I can't see why it's that good.  It shouldn't be that good.  I'm
> > seeing big numbers here:
> >
> >       4366 ą  2%    +565.6%      29061        perf-stat.overall.cycles-between-cache-misses
> >
> > and the code being deleted is only checking vma->vm_ops and
> > vma->anon_vma.  Surely that cache line is referenced so frequently
> > during pagefault that deleting a reference here will make no difference
> > at all?
> 
> That indeed looks overly good. Sorry, I didn't have a chance to run
> the benchmarks on my side yet because of the ongoing Android bootcamp
> this week.
> 
> >
> > We've clearly got an inlining change.  viz:
> >
> >      72.57           -72.6        0.00        perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
> >      73.28           -72.6        0.70        perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
> >      72.55           -72.5        0.00        perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
> >      69.93           -69.9        0.00        perf-profile.calltrace.cycles-pp.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
> >      69.12           -69.1        0.00        perf-profile.calltrace.cycles-pp.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
> >      68.78           -68.8        0.00        perf-profile.calltrace.cycles-pp.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault
> >      65.78           -65.8        0.00        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault
> >      65.43           -65.4        0.00        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma
> >
> >      11.22           +86.5       97.68        perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >      11.14           +86.5       97.66        perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
> >       3.17 ą  2%     +94.0       97.12        perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
> >       3.45 ą  2%     +94.1       97.59        perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
> >       0.00           +98.2       98.15        perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >       0.00           +98.2       98.16        perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >
> > so maybe the compiler has been able to eliminate some loads from
> > contended cachelines?
> >
> >     703147           -87.6%      87147 ą  2%  perf-stat.ps.context-switches
> >     663.67 ą  5%   +7551.9%      50783        vm-scalability.time.involuntary_context_switches
> >  1.105e+08           -86.7%   14697764 ą  2%  vm-scalability.time.voluntary_context_switches
> >
> > indicates to me that we're taking the mmap rwsem far less often (those
> > would be accounted as voluntary context switches).
> >
> > So maybe the cache miss reduction is a consequence of just running for
> > longer before being preempted.
> >
> > I still don't understand why we have to take the mmap_sem less often.
> > Is there perhaps a VMA for which we have a NULL vm_ops, but don't set
> > an anon_vma on a page fault?
> 
> I think the only path in either do_anonymous_page() or
> do_huge_pmd_anonymous_page() that skips calling anon_vma_prepare() is
> the "Use the zero-page for reads" here:
> https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L4265. I
> didn't look into this particular benchmark yet but will try it out
> once I have some time to benchmark your change.
> 

This test is read-only:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-small-allocs-mt

And the zero-page looks to be what's going on here to me as well.

Would such a change have impact on people who "fault in" the memory
instead of asking for memory to be populated through an API?

Thanks,
Liam
Matthew Wilcox April 26, 2024, 3:50 p.m. UTC | #38
On Fri, Apr 26, 2024 at 08:32:06AM -0700, Suren Baghdasaryan wrote:
> On Fri, Apr 26, 2024 at 8:28 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > I think the only path in either do_anonymous_page() or
> > > do_huge_pmd_anonymous_page() that skips calling anon_vma_prepare() is
> > > the "Use the zero-page for reads" here:
> > > https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L4265. I
> > > didn't look into this particular benchmark yet but will try it out
> > > once I have some time to benchmark your change.
> >
> > Yes, Liam and I had just brainstormed that as being a plausible
> > explanation too.  I don't know how frequent it is to use anon memory
> > read-only.  Presumably it must happen often enough that we've bothered
> > to implement the zero-page optimisation.  But probably not nearly as
> > often as this benchmark makes it happen ;-)
> 
> I also wonder if some of this improvement can be attributed to the
> last patch in your series
> (https://lore.kernel.org/all/20240426144506.1290619-5-willy@infradead.org/).
> I assume it was included in the 0day testing?

Patch 4 was where I expected to see the improvement too.  But I think
what's going on is that this benchmark evaded all our hard work on
page fault scalability.  Because it's read-only, it never assigned an
anon_vma and so all its page faults fell back to taking the mmap_sem.
So patch 4 will have no effect on this benchmark.

The report from 0day is pretty clear they bisected the performance
improvement to patch 2.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 78422d1c7381..4e2a9c4d9776 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3219,10 +3219,8 @@  vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
 
 	if (likely(vma->anon_vma))
 		return 0;
-	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
-		vma_end_read(vma);
-		return VM_FAULT_RETRY;
-	}
+	/* We shouldn't try a per-vma fault at all if anon_vma isn't solid */
+	WARN_ON_ONCE(vmf->flags & FAULT_FLAG_VMA_LOCK);
 	if (__anon_vma_prepare(vma))
 		return VM_FAULT_OOM;
 	return 0;
@@ -5826,9 +5824,9 @@  struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	 * find_mergeable_anon_vma uses adjacent vmas which are not locked.
 	 * This check must happen after vma_start_read(); otherwise, a
 	 * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
-	 * from its anon_vma.
+	 * from its anon_vma.  This applies to both anon or private file maps.
 	 */
-	if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
+	if (unlikely(!(vma->vm_flags & VM_SHARED) && !vma->anon_vma))
 		goto inval_end_read;
 
 	/* Check since vm_start/vm_end might change before we lock the VMA */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index f6267afe65d1..61f21da77dcd 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -72,17 +72,8 @@  static struct vm_area_struct *lock_vma(struct mm_struct *mm,
 	struct vm_area_struct *vma;
 
 	vma = lock_vma_under_rcu(mm, address);
-	if (vma) {
-		/*
-		 * lock_vma_under_rcu() only checks anon_vma for private
-		 * anonymous mappings. But we need to ensure it is assigned in
-		 * private file-backed vmas as well.
-		 */
-		if (!(vma->vm_flags & VM_SHARED) && unlikely(!vma->anon_vma))
-			vma_end_read(vma);
-		else
-			return vma;
-	}
+	if (vma)
+		return vma;
 
 	mmap_read_lock(mm);
 	vma = find_vma_and_prepare_anon(mm, address);