diff mbox series

[v2,3/3] fork: lock VMAs of the parent process when forking

Message ID 20230708191212.4147700-3-surenb@google.com (mailing list archive)
State New
Headers show
Series [v2,1/3] mm: lock a vma before stack expansion | expand

Commit Message

Suren Baghdasaryan July 8, 2023, 7:12 p.m. UTC
When forking a child process, parent write-protects an anonymous page
and COW-shares it with the child being forked using copy_present_pte().
Parent's TLB is flushed right before we drop the parent's mmap_lock in
dup_mmap(). If we get a write-fault before that TLB flush in the parent,
and we end up replacing that anonymous page in the parent process in
do_wp_page() (because, COW-shared with the child), this might lead to
some stale writable TLB entries targeting the wrong (old) page.
Similar issue happened in the past with userfaultfd (see flush_tlb_page()
call inside do_wp_page()).
Lock VMAs of the parent process when forking a child, which prevents
concurrent page faults during fork operation and avoids this issue.
This fix can potentially regress some fork-heavy workloads. Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.

Suggested-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
Reported-by: Jacob Young <jacobly.alt@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 kernel/fork.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Suren Baghdasaryan July 8, 2023, 7:22 p.m. UTC | #1
On Sat, Jul 8, 2023 at 12:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> When forking a child process, parent write-protects an anonymous page
> and COW-shares it with the child being forked using copy_present_pte().
> Parent's TLB is flushed right before we drop the parent's mmap_lock in
> dup_mmap(). If we get a write-fault before that TLB flush in the parent,
> and we end up replacing that anonymous page in the parent process in
> do_wp_page() (because, COW-shared with the child), this might lead to
> some stale writable TLB entries targeting the wrong (old) page.
> Similar issue happened in the past with userfaultfd (see flush_tlb_page()
> call inside do_wp_page()).
> Lock VMAs of the parent process when forking a child, which prevents
> concurrent page faults during fork operation and avoids this issue.
> This fix can potentially regress some fork-heavy workloads. Kernel build
> time did not show noticeable regression on a 56-core machine while a
> stress test mapping 10000 VMAs and forking 5000 times in a tight loop
> shows ~5% regression. If such fork time regression is unacceptable,
> disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
> optimizations are possible if this regression proves to be problematic.

Sending this earlier version of the patch per request from Linus and
with his explanation here:
https://lore.kernel.org/all/CAHk-=wi-99-DyMOGywTbjRnRRC+XfpPm=r=pei4A=MEL0QDBXA@mail.gmail.com/

>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Reported-by: Jiri Slaby <jirislaby@kernel.org>
> Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
> Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
> Reported-by: Jacob Young <jacobly.alt@gmail.com>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> Cc: stable@vger.kernel.org
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  kernel/fork.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b85814e614a5..d2e12b6d2b18 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>         for_each_vma(old_vmi, mpnt) {
>                 struct file *file;
>
> +               vma_start_write(mpnt);
>                 if (mpnt->vm_flags & VM_DONTCOPY) {
>                         vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>                         continue;
> --
> 2.41.0.390.g38632f3daf-goog
>
Linus Torvalds July 8, 2023, 9:18 p.m. UTC | #2
On Sat, 8 Jul 2023 at 12:12, Suren Baghdasaryan <surenb@google.com> wrote:
>
>  kernel/fork.c | 1 +
>  1 file changed, 1 insertion(+)

I ended up editing your explanation a lot.

I'm not convinced that the bug has much to do with the delayed tlb flushing.

I think it's more fundamental than some tlb coherence issue: our VM
copying simply expects to not have any unrelated concurrent page fault
activity, and various random internal data structures simply rely on
that.

I made up an example that I'm not sure is relevant to any of the
particular failures, but that I think is a non-TLB case: the parent
'vma->anon_vma' chain is copied by dup_mmap() in anon_vma_fork(), and
it's possible that the parent vma didn't have any anon_vma associated
with it at that point.

But a concurrent page fault to the same vma - even *before* the page
tables have been copied, and when the TLB is still entirely coherent -
could then cause a anon_vma_prepare() on that parent vma, and
associate one of the pages with that anon-vma.

Then the page table copy happens, and that page gets marked read-only
again, and is added to both the parent and the child vma's, but the
child vma never got associated with the parents new anon_vma, because
it didn't exist when anon_vma_fork() happened.

Does this ever happen? I have no idea. But it would seem to be an
example that really has nothing to do with any TLB state, and is just
simply "we cannot handle concurrent page faults while we're busy
copying the mm".

Again - maybe I messed up, but it really feels like the missing
vma_start_write() was more fundamental, and not some "TLB coherency"
issue.

            Linus
Suren Baghdasaryan July 8, 2023, 10:36 p.m. UTC | #3
On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, 8 Jul 2023 at 12:12, Suren Baghdasaryan <surenb@google.com> wrote:
> >
> >  kernel/fork.c | 1 +
> >  1 file changed, 1 insertion(+)
>
> I ended up editing your explanation a lot.
>
> I'm not convinced that the bug has much to do with the delayed tlb flushing.
>
> I think it's more fundamental than some tlb coherence issue: our VM
> copying simply expects to not have any unrelated concurrent page fault
> activity, and various random internal data structures simply rely on
> that.
>
> I made up an example that I'm not sure is relevant to any of the
> particular failures, but that I think is a non-TLB case: the parent
> 'vma->anon_vma' chain is copied by dup_mmap() in anon_vma_fork(), and
> it's possible that the parent vma didn't have any anon_vma associated
> with it at that point.
>
> But a concurrent page fault to the same vma - even *before* the page
> tables have been copied, and when the TLB is still entirely coherent -
> could then cause a anon_vma_prepare() on that parent vma, and
> associate one of the pages with that anon-vma.
>
> Then the page table copy happens, and that page gets marked read-only
> again, and is added to both the parent and the child vma's, but the
> child vma never got associated with the parents new anon_vma, because
> it didn't exist when anon_vma_fork() happened.
>
> Does this ever happen? I have no idea. But it would seem to be an
> example that really has nothing to do with any TLB state, and is just
> simply "we cannot handle concurrent page faults while we're busy
> copying the mm".
>
> Again - maybe I messed up, but it really feels like the missing
> vma_start_write() was more fundamental, and not some "TLB coherency"
> issue.

Sounds plausible. I'll try to use the reproducer to verify if that's
indeed happening here. It's likely there are multiple problematic
scenarios due to this missing lock though.
Thanks,
Suren.

>
>             Linus
Linus Torvalds July 8, 2023, 10:53 p.m. UTC | #4
On Sat, 8 Jul 2023 at 15:36, Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds
> >
> > Again - maybe I messed up, but it really feels like the missing
> > vma_start_write() was more fundamental, and not some "TLB coherency"
> > issue.
>
> Sounds plausible. I'll try to use the reproducer to verify if that's
> indeed happening here.

I really don't think that's what people are reporting, I was just
trying to make up a completely different case that has nothing to do
with any TLB issues.

My real point was simply this one:

> It's likely there are multiple problematic
> scenarios due to this missing lock though.

Right. That's my issue. I felt your explanation was *too* targeted at
some TLB non-coherency thing, when I think the problem was actually a
much larger "page faults simply must not happen while we're copying
the page tables because data isn't coherent".

The anon_vma case was just meant as another random example of the
other kinds of things I suspect can go wrong, because we're simply not
able to do this whole "copy vma while it's being modified by page
faults".

Now, I agree that the PTE problem is real, and probable the main
thing, ie when we as part of fork() do this:

        /*
         * If it's a COW mapping, write protect it both
         * in the parent and the child
         */
        if (is_cow_mapping(vm_flags) && pte_write(pte)) {
                ptep_set_wrprotect(src_mm, addr, src_pte);
                pte = pte_wrprotect(pte);
        }

and the thing that can go wrong before the TLB flush happens is that -
because the TLB's haven't been flushed yet - some threads in the
parent happily continue to write to the page and didn't see the
wrprotect happening.

And then you get into the situation where *some* thread see the page
protections change (maybe they had a TLB flush event on that CPU for
random reasons), and they will take a page fault and do the COW thing
and create a new page.

And all the while *other* threads still see the old writeable TLB
state, and continue to write to the old page.

So now you have a page that gets its data copied *while* somebody is
still writing to it, and the end result is that some write easily gets
lost, and so when that new copy is installed, you see it as data
corruption.

And I agree completely that that is probably the thing that most
people actually saw and reacted to as corruption.

But the reason I didn't like the explanation was that I think this is
just one random example of the more fundamental issue of "we simply
must not take page faults while copying".

Your explanation made me think "stale TLB is the problem", and *that*
was what I objected to. The stale TLB was just one random sign of the
much larger problem.

It might even have been the most common symptom, but I think it was
just a *symptom*, not the *cause* of the problem.

And I must have been bad at explaining that, because David Hildenbrand
also reacted negatively to my change.

So I'll happily take a patch that adds more commentary about this, and
gives several examples of the things that go wrong.

                Linus
Suren Baghdasaryan July 8, 2023, 11:03 p.m. UTC | #5
On Sat, Jul 8, 2023 at 3:54 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, 8 Jul 2023 at 15:36, Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds
> > >
> > > Again - maybe I messed up, but it really feels like the missing
> > > vma_start_write() was more fundamental, and not some "TLB coherency"
> > > issue.
> >
> > Sounds plausible. I'll try to use the reproducer to verify if that's
> > indeed happening here.
>
> I really don't think that's what people are reporting, I was just
> trying to make up a completely different case that has nothing to do
> with any TLB issues.
>
> My real point was simply this one:
>
> > It's likely there are multiple problematic
> > scenarios due to this missing lock though.
>
> Right. That's my issue. I felt your explanation was *too* targeted at
> some TLB non-coherency thing, when I think the problem was actually a
> much larger "page faults simply must not happen while we're copying
> the page tables because data isn't coherent".
>
> The anon_vma case was just meant as another random example of the
> other kinds of things I suspect can go wrong, because we're simply not
> able to do this whole "copy vma while it's being modified by page
> faults".
>
> Now, I agree that the PTE problem is real, and probable the main
> thing, ie when we as part of fork() do this:
>
>         /*
>          * If it's a COW mapping, write protect it both
>          * in the parent and the child
>          */
>         if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>                 ptep_set_wrprotect(src_mm, addr, src_pte);
>                 pte = pte_wrprotect(pte);
>         }
>
> and the thing that can go wrong before the TLB flush happens is that -
> because the TLB's haven't been flushed yet - some threads in the
> parent happily continue to write to the page and didn't see the
> wrprotect happening.
>
> And then you get into the situation where *some* thread see the page
> protections change (maybe they had a TLB flush event on that CPU for
> random reasons), and they will take a page fault and do the COW thing
> and create a new page.
>
> And all the while *other* threads still see the old writeable TLB
> state, and continue to write to the old page.
>
> So now you have a page that gets its data copied *while* somebody is
> still writing to it, and the end result is that some write easily gets
> lost, and so when that new copy is installed, you see it as data
> corruption.
>
> And I agree completely that that is probably the thing that most
> people actually saw and reacted to as corruption.
>
> But the reason I didn't like the explanation was that I think this is
> just one random example of the more fundamental issue of "we simply
> must not take page faults while copying".
>
> Your explanation made me think "stale TLB is the problem", and *that*
> was what I objected to. The stale TLB was just one random sign of the
> much larger problem.
>
> It might even have been the most common symptom, but I think it was
> just a *symptom*, not the *cause* of the problem.
>
> And I must have been bad at explaining that, because David Hildenbrand
> also reacted negatively to my change.
>
> So I'll happily take a patch that adds more commentary about this, and
> gives several examples of the things that go wrong.

How about adding your example to the original description as yet
another scenario which is broken without this change? I guess having
both issues described would not hurt.

>
>                 Linus
Mateusz Guzik Aug. 4, 2023, 9:46 p.m. UTC | #6
On Sat, Jul 08, 2023 at 12:12:12PM -0700, Suren Baghdasaryan wrote:
[..]
> Lock VMAs of the parent process when forking a child, which prevents
> concurrent page faults during fork operation and avoids this issue.
> This fix can potentially regress some fork-heavy workloads. Kernel build
> time did not show noticeable regression on a 56-core machine while a
> stress test mapping 10000 VMAs and forking 5000 times in a tight loop
> shows ~5% regression. If such fork time regression is unacceptable,
> disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
> optimizations are possible if this regression proves to be problematic.
> 
> ---
>  kernel/fork.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b85814e614a5..d2e12b6d2b18 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  	for_each_vma(old_vmi, mpnt) {
>  		struct file *file;
>  
> +		vma_start_write(mpnt);
>  		if (mpnt->vm_flags & VM_DONTCOPY) {
>  			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>  			continue;
> 

I don't see it mentioned in the discussion, so at a risk of ruffling
feathers or looking really bad I'm going to ask: is the locking of any
use if the forking process is single-threaded? The singular thread in
this case is occupied executing this very code, so it can't do any op
in parallel. Is there anyone else who could trigger a page fault? Are
these shared with other processes? Cursory reading suggests a private
copy is made here, so my guess is no. But then again, I landed here
freshly from the interwebz.

Or in short: if nobody can mess up the state if the forking process is
single-threaded, why not check for mm_users or whatever other indicator
to elide the slowdown for the (arguably) most common case?

If the state can be messed up anyway, that's a shame, but short
explanation how would be welcome.

to illustrate (totally untested):
diff --git a/kernel/fork.c b/kernel/fork.c
index d2e12b6d2b18..aac6b08a0b21 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -652,6 +652,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        LIST_HEAD(uf);
        VMA_ITERATOR(old_vmi, oldmm, 0);
        VMA_ITERATOR(vmi, mm, 0);
+       bool singlethread = READ_ONCE(oldmm->mm_users) == 1;

        uprobe_start_dup_mmap();
        if (mmap_write_lock_killable(oldmm)) {
@@ -686,7 +687,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        for_each_vma(old_vmi, mpnt) {
                struct file *file;

-               vma_start_write(mpnt);
+               if (!singelthreaded)
+                       vma_start_write(mpnt);
                if (mpnt->vm_flags & VM_DONTCOPY) {
                        vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
                        continue;
Linus Torvalds Aug. 4, 2023, 10:49 p.m. UTC | #7
On Fri, 4 Aug 2023 at 14:46, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I don't see it mentioned in the discussion, so at a risk of ruffling
> feathers or looking really bad I'm going to ask: is the locking of any
> use if the forking process is single-threaded? T

Sadly, we've always been able to access the mm from other processes,
so the locking is - I think - unavoidable.

And some of those "access from other processes" aren't even uncommon
or special. It's things like "ps" etc, that do it just to see the
process name and arguments.

            Linus
Mateusz Guzik Aug. 4, 2023, 11:25 p.m. UTC | #8
On 8/5/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 4 Aug 2023 at 14:46, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> I don't see it mentioned in the discussion, so at a risk of ruffling
>> feathers or looking really bad I'm going to ask: is the locking of any
>> use if the forking process is single-threaded? T
>
> Sadly, we've always been able to access the mm from other processes,
> so the locking is - I think - unavoidable.
>
> And some of those "access from other processes" aren't even uncommon
> or special. It's things like "ps" etc, that do it just to see the
> process name and arguments.
>

I know of these guys, I think they are excluded as is -- they go
through access_remote_vm, starting with:
        if (mmap_read_lock_killable(mm))
                return 0;

while dup_mmap already write locks the parent's mm.

I don't see any surprise relocks of the semaphore.

Granted, should someone *bypass* this mechanism the above would be moot.
Linus Torvalds Aug. 5, 2023, 12:14 a.m. UTC | #9
On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I know of these guys, I think they are excluded as is -- they go
> through access_remote_vm, starting with:
>         if (mmap_read_lock_killable(mm))
>                 return 0;
>
> while dup_mmap already write locks the parent's mm.

Oh, you're only worried about vma_start_write()?

That's a non-issue. It doesn't take the lock normally, since it starts off with

        if (__is_vma_write_locked(vma, &mm_lock_seq))
                return;

which catches on the lock sequence number already being set.

So no extra locking there.

Well, technically there's extra locking because the code stupidly
doesn't initialize new vma allocations to the right sequence number,
but that was talked about here:

    https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/

and it's a separate issue.

          Linus
Suren Baghdasaryan Aug. 5, 2023, 12:26 a.m. UTC | #10
On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > I know of these guys, I think they are excluded as is -- they go
> > through access_remote_vm, starting with:
> >         if (mmap_read_lock_killable(mm))
> >                 return 0;
> >
> > while dup_mmap already write locks the parent's mm.
>
> Oh, you're only worried about vma_start_write()?
>
> That's a non-issue. It doesn't take the lock normally, since it starts off with
>
>         if (__is_vma_write_locked(vma, &mm_lock_seq))
>                 return;
>
> which catches on the lock sequence number already being set.

That check will prevent re-locking but if vma is not already locked
then the call will proceed with obtaining the lock and setting
vma->vm_lock_seq to mm->mm_lock_seq.

>
> So no extra locking there.
>
> Well, technically there's extra locking because the code stupidly
> doesn't initialize new vma allocations to the right sequence number,
> but that was talked about here:
>
>     https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/
>
> and it's a separate issue.
>
>           Linus
Suren Baghdasaryan Aug. 5, 2023, 12:34 a.m. UTC | #11
On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > I know of these guys, I think they are excluded as is -- they go
> > > through access_remote_vm, starting with:
> > >         if (mmap_read_lock_killable(mm))
> > >                 return 0;
> > >
> > > while dup_mmap already write locks the parent's mm.
> >
> > Oh, you're only worried about vma_start_write()?
> >
> > That's a non-issue. It doesn't take the lock normally, since it starts off with
> >
> >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> >                 return;
> >
> > which catches on the lock sequence number already being set.
>
> That check will prevent re-locking but if vma is not already locked
> then the call will proceed with obtaining the lock and setting
> vma->vm_lock_seq to mm->mm_lock_seq.

The optimization Mateusz describes looks valid to me. If there is
nobody else to fault a page and mm_users is stable (which I think it
is because we are holding mmap_lock for write) then we can skip vma
locking, I think.

>
> >
> > So no extra locking there.
> >
> > Well, technically there's extra locking because the code stupidly
> > doesn't initialize new vma allocations to the right sequence number,
> > but that was talked about here:
> >
> >     https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/
> >
> > and it's a separate issue.
> >
> >           Linus
Mateusz Guzik Aug. 5, 2023, 12:49 a.m. UTC | #12
On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote:
> On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan <surenb@google.com>
> wrote:
>>
>> On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> >
>> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> > >
>> > > I know of these guys, I think they are excluded as is -- they go
>> > > through access_remote_vm, starting with:
>> > >         if (mmap_read_lock_killable(mm))
>> > >                 return 0;
>> > >
>> > > while dup_mmap already write locks the parent's mm.
>> >
>> > Oh, you're only worried about vma_start_write()?
>> >
>> > That's a non-issue. It doesn't take the lock normally, since it starts
>> > off with
>> >
>> >         if (__is_vma_write_locked(vma, &mm_lock_seq))
>> >                 return;
>> >
>> > which catches on the lock sequence number already being set.
>>
>> That check will prevent re-locking but if vma is not already locked
>> then the call will proceed with obtaining the lock and setting
>> vma->vm_lock_seq to mm->mm_lock_seq.
>
> The optimization Mateusz describes looks valid to me. If there is
> nobody else to fault a page and mm_users is stable (which I think it
> is because we are holding mmap_lock for write) then we can skip vma
> locking, I think.
>

mm_users is definitely *not* stable -- it can be bumped by
get_task_mm, which is only synchronized with task lock.

However, the other users (that I know of ) go through the mmap
semaphore to mess with anything which means they will wait for
dup_mmap to finish (or do their work first). I would be surprised if
there were any cases which don't take the semaphore, given that it was
a requirement prior to the vma patchset (unless you patched some to no
longer need it?). I would guess worst case the semaphore can be added
if missing.

What is guaranteed is that if the forking process is single-threaded,
there will be no threads added out of nowhere -- the only thread which
could do it is busy creating one in dup_mmap. If multithreaded
operation of the forking process was the only problem, that's it.

>>
>> >
>> > So no extra locking there.
>> >
>> > Well, technically there's extra locking because the code stupidly
>> > doesn't initialize new vma allocations to the right sequence number,
>> > but that was talked about here:
>> >
>> >
>> > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/
>> >
>> > and it's a separate issue.
>> >
>> >           Linus
>
Mateusz Guzik Aug. 5, 2023, 1:06 a.m. UTC | #13
On 8/5/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> I know of these guys, I think they are excluded as is -- they go
>> through access_remote_vm, starting with:
>>         if (mmap_read_lock_killable(mm))
>>                 return 0;
>>
>> while dup_mmap already write locks the parent's mm.
>
> Oh, you're only worried about vma_start_write()?
>
> That's a non-issue. It doesn't take the lock normally, since it starts off
> with
>
>         if (__is_vma_write_locked(vma, &mm_lock_seq))
>                 return;
>
> which catches on the lock sequence number already being set.
>
> So no extra locking there.
>
> Well, technically there's extra locking because the code stupidly
> doesn't initialize new vma allocations to the right sequence number,
> but that was talked about here:
>
>
> https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/
>
> and it's a separate issue.
>

I'm going to bet one beer this is the issue.

The patch I'm responding to only consists of adding the call to
vma_start_write and claims the 5% slowdown from it, while fixing
crashes if the forking process is multithreaded.

For the fix to work it has to lock something against the parent.

        VMA_ITERATOR(old_vmi, oldmm, 0);
[..]
        for_each_vma(old_vmi, mpnt) {
[..]
                vma_start_write(mpnt);

the added line locks an obj in the parent's vm space.

The problem you linked looks like pessimization for freshly allocated
vmas, but that's what is being operated on here.
Suren Baghdasaryan Aug. 5, 2023, 1:06 a.m. UTC | #14
On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote:
> > On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan <surenb@google.com>
> > wrote:
> >>
> >> On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds
> >> <torvalds@linux-foundation.org> wrote:
> >> >
> >> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> > >
> >> > > I know of these guys, I think they are excluded as is -- they go
> >> > > through access_remote_vm, starting with:
> >> > >         if (mmap_read_lock_killable(mm))
> >> > >                 return 0;
> >> > >
> >> > > while dup_mmap already write locks the parent's mm.
> >> >
> >> > Oh, you're only worried about vma_start_write()?
> >> >
> >> > That's a non-issue. It doesn't take the lock normally, since it starts
> >> > off with
> >> >
> >> >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> >> >                 return;
> >> >
> >> > which catches on the lock sequence number already being set.
> >>
> >> That check will prevent re-locking but if vma is not already locked
> >> then the call will proceed with obtaining the lock and setting
> >> vma->vm_lock_seq to mm->mm_lock_seq.
> >
> > The optimization Mateusz describes looks valid to me. If there is
> > nobody else to fault a page and mm_users is stable (which I think it
> > is because we are holding mmap_lock for write) then we can skip vma
> > locking, I think.
> >
>
> mm_users is definitely *not* stable -- it can be bumped by
> get_task_mm, which is only synchronized with task lock.

Ugh, you are of course correct. Poor choice for saying no new users
(threads) can appear from under us.

>
> However, the other users (that I know of ) go through the mmap
> semaphore to mess with anything which means they will wait for
> dup_mmap to finish (or do their work first). I would be surprised if
> there were any cases which don't take the semaphore, given that it was
> a requirement prior to the vma patchset (unless you patched some to no
> longer need it?). I would guess worst case the semaphore can be added
> if missing.

No, the only mmap_lock read-lock that is affected is during the page
fault, which is expected.

>
> What is guaranteed is that if the forking process is single-threaded,
> there will be no threads added out of nowhere -- the only thread which
> could do it is busy creating one in dup_mmap. If multithreaded
> operation of the forking process was the only problem, that's it.
>
> >>
> >> >
> >> > So no extra locking there.
> >> >
> >> > Well, technically there's extra locking because the code stupidly
> >> > doesn't initialize new vma allocations to the right sequence number,
> >> > but that was talked about here:
> >> >
> >> >
> >> > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/
> >> >
> >> > and it's a separate issue.
> >> >
> >> >           Linus
> >
>
>
> --
> Mateusz Guzik <mjguzik gmail.com>
Mateusz Guzik Aug. 5, 2023, 1:16 a.m. UTC | #15
On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote:
> On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>> However, the other users (that I know of ) go through the mmap
>> semaphore to mess with anything which means they will wait for
>> dup_mmap to finish (or do their work first). I would be surprised if
>> there were any cases which don't take the semaphore, given that it was
>> a requirement prior to the vma patchset (unless you patched some to no
>> longer need it?). I would guess worst case the semaphore can be added
>> if missing.
>
> No, the only mmap_lock read-lock that is affected is during the page
> fault, which is expected.
>

I have difficulty parsing your statement.

I am saying that any 3rd parties which can trigger page faults already
read lock mmap_lock or can be made to do it (and I don't know any case
which does not already, but I'm not willing to spend time poking
around to make sure). One can consider 3rd parties as not a problem,
modulo the audit.

Past that there does is no known source of trouble? In my original
e-mail I was worried about processes up the chain in ancestry, perhaps
some of the state is shared(?) and the locking at hand neuters any
problems. I'm guessing this is not necessary.

Bottom line though it looks like this will work fine?

That said, I'm not going to submit a patch I can't confidently defend.
As I did not dig into any of the VMA code and can't be arsed to audit
all places which mess with "foreign" mm, I'm definitely not submitting
this myself. You are most welcome to write your own variant at your
leisure. :)
Suren Baghdasaryan Aug. 5, 2023, 1:36 a.m. UTC | #16
On Fri, Aug 4, 2023 at 6:17 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote:
> > On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> However, the other users (that I know of ) go through the mmap
> >> semaphore to mess with anything which means they will wait for
> >> dup_mmap to finish (or do their work first). I would be surprised if
> >> there were any cases which don't take the semaphore, given that it was
> >> a requirement prior to the vma patchset (unless you patched some to no
> >> longer need it?). I would guess worst case the semaphore can be added
> >> if missing.
> >
> > No, the only mmap_lock read-lock that is affected is during the page
> > fault, which is expected.
> >
>
> I have difficulty parsing your statement.

I was just saying that vma lock patchset did not touch any other
mmap_locking paths except for the page fault one where we try to skip
read-locking mmap_lock.

>
> I am saying that any 3rd parties which can trigger page faults already
> read lock mmap_lock or can be made to do it (and I don't know any case
> which does not already, but I'm not willing to spend time poking
> around to make sure). One can consider 3rd parties as not a problem,
> modulo the audit.
>
> Past that there does is no known source of trouble? In my original
> e-mail I was worried about processes up the chain in ancestry, perhaps
> some of the state is shared(?) and the locking at hand neuters any
> problems. I'm guessing this is not necessary.
>
> Bottom line though it looks like this will work fine?
>
> That said, I'm not going to submit a patch I can't confidently defend.
> As I did not dig into any of the VMA code and can't be arsed to audit
> all places which mess with "foreign" mm, I'm definitely not submitting
> this myself. You are most welcome to write your own variant at your
> leisure. :)

Ok, I see. I'll need to double check locking when a 3rd party is
involved. Will post a patch when I'm confident enough it's safe.
Thanks!

>
> --
> Mateusz Guzik <mjguzik gmail.com>
Suren Baghdasaryan Aug. 5, 2023, 1:42 a.m. UTC | #17
On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On 8/5/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >>
> >> I know of these guys, I think they are excluded as is -- they go
> >> through access_remote_vm, starting with:
> >>         if (mmap_read_lock_killable(mm))
> >>                 return 0;
> >>
> >> while dup_mmap already write locks the parent's mm.
> >
> > Oh, you're only worried about vma_start_write()?
> >
> > That's a non-issue. It doesn't take the lock normally, since it starts off
> > with
> >
> >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> >                 return;
> >
> > which catches on the lock sequence number already being set.
> >
> > So no extra locking there.
> >
> > Well, technically there's extra locking because the code stupidly
> > doesn't initialize new vma allocations to the right sequence number,
> > but that was talked about here:
> >
> >
> > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/
> >
> > and it's a separate issue.
> >
>
> I'm going to bet one beer this is the issue.
>
> The patch I'm responding to only consists of adding the call to
> vma_start_write and claims the 5% slowdown from it, while fixing
> crashes if the forking process is multithreaded.
>
> For the fix to work it has to lock something against the parent.
>
>         VMA_ITERATOR(old_vmi, oldmm, 0);
> [..]
>         for_each_vma(old_vmi, mpnt) {
> [..]
>                 vma_start_write(mpnt);
>
> the added line locks an obj in the parent's vm space.
>
> The problem you linked looks like pessimization for freshly allocated
> vmas, but that's what is being operated on here.

Sorry, now I'm having trouble understanding the problem you are
describing. We are locking the parent's vma before copying it and the
newly created vma is locked before it's added into the vma tree. What
is the problem then?

>
> --
> Mateusz Guzik <mjguzik gmail.com>
Mateusz Guzik Aug. 9, 2023, 9:07 p.m. UTC | #18
On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote:
> On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> On 8/5/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >>
>> >> I know of these guys, I think they are excluded as is -- they go
>> >> through access_remote_vm, starting with:
>> >>         if (mmap_read_lock_killable(mm))
>> >>                 return 0;
>> >>
>> >> while dup_mmap already write locks the parent's mm.
>> >
>> > Oh, you're only worried about vma_start_write()?
>> >
>> > That's a non-issue. It doesn't take the lock normally, since it starts
>> > off
>> > with
>> >
>> >         if (__is_vma_write_locked(vma, &mm_lock_seq))
>> >                 return;
>> >
>> > which catches on the lock sequence number already being set.
>> >
>> > So no extra locking there.
>> >
>> > Well, technically there's extra locking because the code stupidly
>> > doesn't initialize new vma allocations to the right sequence number,
>> > but that was talked about here:
>> >
>> >
>> > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/
>> >
>> > and it's a separate issue.
>> >
>>
>> I'm going to bet one beer this is the issue.
>>
>> The patch I'm responding to only consists of adding the call to
>> vma_start_write and claims the 5% slowdown from it, while fixing
>> crashes if the forking process is multithreaded.
>>
>> For the fix to work it has to lock something against the parent.
>>
>>         VMA_ITERATOR(old_vmi, oldmm, 0);
>> [..]
>>         for_each_vma(old_vmi, mpnt) {
>> [..]
>>                 vma_start_write(mpnt);
>>
>> the added line locks an obj in the parent's vm space.
>>
>> The problem you linked looks like pessimization for freshly allocated
>> vmas, but that's what is being operated on here.
>
> Sorry, now I'm having trouble understanding the problem you are
> describing. We are locking the parent's vma before copying it and the
> newly created vma is locked before it's added into the vma tree. What
> is the problem then?
>

Sorry for the late reply!

Looks there has been a bunch of weird talking past one another in this
thread and I don't think trying to straighten it all out is worth any
time.

I think at least the two of us agree that if a single-threaded process
enters dup_mmap an
down_writes the mmap semaphore, then no new thread can pop up in said
process, thus no surprise page faults from that angle. 3rd parties are
supposed to interfaces like access_remote_vm, which down_read said
semaphore and are consequently also not a problem. The only worry here
is that someone is messing with another process memory without the
semaphore, but is very unlikely and patchable in the worst case -- but
someone(tm) has to audit. With all these conditions satisfied one can
elide vma_start_write for a perf win.

Finally, I think we agreed you are going to do the audit ;)

Cheers,
Suren Baghdasaryan Aug. 10, 2023, 8:31 p.m. UTC | #19
On Wed, Aug 9, 2023 at 2:07 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote:
> > On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >>
> >> On 8/5/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> >>
> >> >> I know of these guys, I think they are excluded as is -- they go
> >> >> through access_remote_vm, starting with:
> >> >>         if (mmap_read_lock_killable(mm))
> >> >>                 return 0;
> >> >>
> >> >> while dup_mmap already write locks the parent's mm.
> >> >
> >> > Oh, you're only worried about vma_start_write()?
> >> >
> >> > That's a non-issue. It doesn't take the lock normally, since it starts
> >> > off
> >> > with
> >> >
> >> >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> >> >                 return;
> >> >
> >> > which catches on the lock sequence number already being set.
> >> >
> >> > So no extra locking there.
> >> >
> >> > Well, technically there's extra locking because the code stupidly
> >> > doesn't initialize new vma allocations to the right sequence number,
> >> > but that was talked about here:
> >> >
> >> >
> >> > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/
> >> >
> >> > and it's a separate issue.
> >> >
> >>
> >> I'm going to bet one beer this is the issue.
> >>
> >> The patch I'm responding to only consists of adding the call to
> >> vma_start_write and claims the 5% slowdown from it, while fixing
> >> crashes if the forking process is multithreaded.
> >>
> >> For the fix to work it has to lock something against the parent.
> >>
> >>         VMA_ITERATOR(old_vmi, oldmm, 0);
> >> [..]
> >>         for_each_vma(old_vmi, mpnt) {
> >> [..]
> >>                 vma_start_write(mpnt);
> >>
> >> the added line locks an obj in the parent's vm space.
> >>
> >> The problem you linked looks like pessimization for freshly allocated
> >> vmas, but that's what is being operated on here.
> >
> > Sorry, now I'm having trouble understanding the problem you are
> > describing. We are locking the parent's vma before copying it and the
> > newly created vma is locked before it's added into the vma tree. What
> > is the problem then?
> >
>
> Sorry for the late reply!
>
> Looks there has been a bunch of weird talking past one another in this
> thread and I don't think trying to straighten it all out is worth any
> time.
>
> I think at least the two of us agree that if a single-threaded process
> enters dup_mmap an
> down_writes the mmap semaphore, then no new thread can pop up in said
> process, thus no surprise page faults from that angle. 3rd parties are
> supposed to interfaces like access_remote_vm, which down_read said
> semaphore and are consequently also not a problem. The only worry here
> is that someone is messing with another process memory without the
> semaphore, but is very unlikely and patchable in the worst case -- but
> someone(tm) has to audit. With all these conditions satisfied one can
> elide vma_start_write for a perf win.
>
> Finally, I think we agreed you are going to do the audit ;)

Ack. I'll look into this once the dust settles. Thanks!

>
> Cheers,
> --
> Mateusz Guzik <mjguzik gmail.com>
diff mbox series

Patch

diff --git a/kernel/fork.c b/kernel/fork.c
index b85814e614a5..d2e12b6d2b18 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -686,6 +686,7 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	for_each_vma(old_vmi, mpnt) {
 		struct file *file;
 
+		vma_start_write(mpnt);
 		if (mpnt->vm_flags & VM_DONTCOPY) {
 			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
 			continue;