Message ID | 20230708191212.4147700-3-surenb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/3] mm: lock a vma before stack expansion | expand |
On Sat, Jul 8, 2023 at 12:12 PM Suren Baghdasaryan <surenb@google.com> wrote: > > When forking a child process, parent write-protects an anonymous page > and COW-shares it with the child being forked using copy_present_pte(). > Parent's TLB is flushed right before we drop the parent's mmap_lock in > dup_mmap(). If we get a write-fault before that TLB flush in the parent, > and we end up replacing that anonymous page in the parent process in > do_wp_page() (because, COW-shared with the child), this might lead to > some stale writable TLB entries targeting the wrong (old) page. > Similar issue happened in the past with userfaultfd (see flush_tlb_page() > call inside do_wp_page()). > Lock VMAs of the parent process when forking a child, which prevents > concurrent page faults during fork operation and avoids this issue. > This fix can potentially regress some fork-heavy workloads. Kernel build > time did not show noticeable regression on a 56-core machine while a > stress test mapping 10000 VMAs and forking 5000 times in a tight loop > shows ~5% regression. If such fork time regression is unacceptable, > disabling CONFIG_PER_VMA_LOCK should restore its performance. Further > optimizations are possible if this regression proves to be problematic. Sending this earlier version of the patch per request from Linus and with his explanation here: https://lore.kernel.org/all/CAHk-=wi-99-DyMOGywTbjRnRRC+XfpPm=r=pei4A=MEL0QDBXA@mail.gmail.com/ > > Suggested-by: David Hildenbrand <david@redhat.com> > Reported-by: Jiri Slaby <jirislaby@kernel.org> > Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ > Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> > Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/ > Reported-by: Jacob Young <jacobly.alt@gmail.com> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 > Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") > Cc: stable@vger.kernel.org > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > --- > kernel/fork.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/fork.c b/kernel/fork.c > index b85814e614a5..d2e12b6d2b18 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > for_each_vma(old_vmi, mpnt) { > struct file *file; > > + vma_start_write(mpnt); > if (mpnt->vm_flags & VM_DONTCOPY) { > vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); > continue; > -- > 2.41.0.390.g38632f3daf-goog >
On Sat, 8 Jul 2023 at 12:12, Suren Baghdasaryan <surenb@google.com> wrote: > > kernel/fork.c | 1 + > 1 file changed, 1 insertion(+) I ended up editing your explanation a lot. I'm not convinced that the bug has much to do with the delayed tlb flushing. I think it's more fundamental than some tlb coherence issue: our VM copying simply expects to not have any unrelated concurrent page fault activity, and various random internal data structures simply rely on that. I made up an example that I'm not sure is relevant to any of the particular failures, but that I think is a non-TLB case: the parent 'vma->anon_vma' chain is copied by dup_mmap() in anon_vma_fork(), and it's possible that the parent vma didn't have any anon_vma associated with it at that point. But a concurrent page fault to the same vma - even *before* the page tables have been copied, and when the TLB is still entirely coherent - could then cause a anon_vma_prepare() on that parent vma, and associate one of the pages with that anon-vma. Then the page table copy happens, and that page gets marked read-only again, and is added to both the parent and the child vma's, but the child vma never got associated with the parents new anon_vma, because it didn't exist when anon_vma_fork() happened. Does this ever happen? I have no idea. But it would seem to be an example that really has nothing to do with any TLB state, and is just simply "we cannot handle concurrent page faults while we're busy copying the mm". Again - maybe I messed up, but it really feels like the missing vma_start_write() was more fundamental, and not some "TLB coherency" issue. Linus
On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, 8 Jul 2023 at 12:12, Suren Baghdasaryan <surenb@google.com> wrote: > > > > kernel/fork.c | 1 + > > 1 file changed, 1 insertion(+) > > I ended up editing your explanation a lot. > > I'm not convinced that the bug has much to do with the delayed tlb flushing. > > I think it's more fundamental than some tlb coherence issue: our VM > copying simply expects to not have any unrelated concurrent page fault > activity, and various random internal data structures simply rely on > that. > > I made up an example that I'm not sure is relevant to any of the > particular failures, but that I think is a non-TLB case: the parent > 'vma->anon_vma' chain is copied by dup_mmap() in anon_vma_fork(), and > it's possible that the parent vma didn't have any anon_vma associated > with it at that point. > > But a concurrent page fault to the same vma - even *before* the page > tables have been copied, and when the TLB is still entirely coherent - > could then cause a anon_vma_prepare() on that parent vma, and > associate one of the pages with that anon-vma. > > Then the page table copy happens, and that page gets marked read-only > again, and is added to both the parent and the child vma's, but the > child vma never got associated with the parents new anon_vma, because > it didn't exist when anon_vma_fork() happened. > > Does this ever happen? I have no idea. But it would seem to be an > example that really has nothing to do with any TLB state, and is just > simply "we cannot handle concurrent page faults while we're busy > copying the mm". > > Again - maybe I messed up, but it really feels like the missing > vma_start_write() was more fundamental, and not some "TLB coherency" > issue. Sounds plausible. I'll try to use the reproducer to verify if that's indeed happening here. It's likely there are multiple problematic scenarios due to this missing lock though. Thanks, Suren. > > Linus
On Sat, 8 Jul 2023 at 15:36, Suren Baghdasaryan <surenb@google.com> wrote: > > On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds > > > > Again - maybe I messed up, but it really feels like the missing > > vma_start_write() was more fundamental, and not some "TLB coherency" > > issue. > > Sounds plausible. I'll try to use the reproducer to verify if that's > indeed happening here. I really don't think that's what people are reporting, I was just trying to make up a completely different case that has nothing to do with any TLB issues. My real point was simply this one: > It's likely there are multiple problematic > scenarios due to this missing lock though. Right. That's my issue. I felt your explanation was *too* targeted at some TLB non-coherency thing, when I think the problem was actually a much larger "page faults simply must not happen while we're copying the page tables because data isn't coherent". The anon_vma case was just meant as another random example of the other kinds of things I suspect can go wrong, because we're simply not able to do this whole "copy vma while it's being modified by page faults". Now, I agree that the PTE problem is real, and probable the main thing, ie when we as part of fork() do this: /* * If it's a COW mapping, write protect it both * in the parent and the child */ if (is_cow_mapping(vm_flags) && pte_write(pte)) { ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); } and the thing that can go wrong before the TLB flush happens is that - because the TLB's haven't been flushed yet - some threads in the parent happily continue to write to the page and didn't see the wrprotect happening. And then you get into the situation where *some* thread see the page protections change (maybe they had a TLB flush event on that CPU for random reasons), and they will take a page fault and do the COW thing and create a new page. And all the while *other* threads still see the old writeable TLB state, and continue to write to the old page. So now you have a page that gets its data copied *while* somebody is still writing to it, and the end result is that some write easily gets lost, and so when that new copy is installed, you see it as data corruption. And I agree completely that that is probably the thing that most people actually saw and reacted to as corruption. But the reason I didn't like the explanation was that I think this is just one random example of the more fundamental issue of "we simply must not take page faults while copying". Your explanation made me think "stale TLB is the problem", and *that* was what I objected to. The stale TLB was just one random sign of the much larger problem. It might even have been the most common symptom, but I think it was just a *symptom*, not the *cause* of the problem. And I must have been bad at explaining that, because David Hildenbrand also reacted negatively to my change. So I'll happily take a patch that adds more commentary about this, and gives several examples of the things that go wrong. Linus
On Sat, Jul 8, 2023 at 3:54 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, 8 Jul 2023 at 15:36, Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds > > > > > > Again - maybe I messed up, but it really feels like the missing > > > vma_start_write() was more fundamental, and not some "TLB coherency" > > > issue. > > > > Sounds plausible. I'll try to use the reproducer to verify if that's > > indeed happening here. > > I really don't think that's what people are reporting, I was just > trying to make up a completely different case that has nothing to do > with any TLB issues. > > My real point was simply this one: > > > It's likely there are multiple problematic > > scenarios due to this missing lock though. > > Right. That's my issue. I felt your explanation was *too* targeted at > some TLB non-coherency thing, when I think the problem was actually a > much larger "page faults simply must not happen while we're copying > the page tables because data isn't coherent". > > The anon_vma case was just meant as another random example of the > other kinds of things I suspect can go wrong, because we're simply not > able to do this whole "copy vma while it's being modified by page > faults". > > Now, I agree that the PTE problem is real, and probable the main > thing, ie when we as part of fork() do this: > > /* > * If it's a COW mapping, write protect it both > * in the parent and the child > */ > if (is_cow_mapping(vm_flags) && pte_write(pte)) { > ptep_set_wrprotect(src_mm, addr, src_pte); > pte = pte_wrprotect(pte); > } > > and the thing that can go wrong before the TLB flush happens is that - > because the TLB's haven't been flushed yet - some threads in the > parent happily continue to write to the page and didn't see the > wrprotect happening. > > And then you get into the situation where *some* thread see the page > protections change (maybe they had a TLB flush event on that CPU for > random reasons), and they will take a page fault and do the COW thing > and create a new page. > > And all the while *other* threads still see the old writeable TLB > state, and continue to write to the old page. > > So now you have a page that gets its data copied *while* somebody is > still writing to it, and the end result is that some write easily gets > lost, and so when that new copy is installed, you see it as data > corruption. > > And I agree completely that that is probably the thing that most > people actually saw and reacted to as corruption. > > But the reason I didn't like the explanation was that I think this is > just one random example of the more fundamental issue of "we simply > must not take page faults while copying". > > Your explanation made me think "stale TLB is the problem", and *that* > was what I objected to. The stale TLB was just one random sign of the > much larger problem. > > It might even have been the most common symptom, but I think it was > just a *symptom*, not the *cause* of the problem. > > And I must have been bad at explaining that, because David Hildenbrand > also reacted negatively to my change. > > So I'll happily take a patch that adds more commentary about this, and > gives several examples of the things that go wrong. How about adding your example to the original description as yet another scenario which is broken without this change? I guess having both issues described would not hurt. > > Linus
On Sat, Jul 08, 2023 at 12:12:12PM -0700, Suren Baghdasaryan wrote: [..] > Lock VMAs of the parent process when forking a child, which prevents > concurrent page faults during fork operation and avoids this issue. > This fix can potentially regress some fork-heavy workloads. Kernel build > time did not show noticeable regression on a 56-core machine while a > stress test mapping 10000 VMAs and forking 5000 times in a tight loop > shows ~5% regression. If such fork time regression is unacceptable, > disabling CONFIG_PER_VMA_LOCK should restore its performance. Further > optimizations are possible if this regression proves to be problematic. > > --- > kernel/fork.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/fork.c b/kernel/fork.c > index b85814e614a5..d2e12b6d2b18 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > for_each_vma(old_vmi, mpnt) { > struct file *file; > > + vma_start_write(mpnt); > if (mpnt->vm_flags & VM_DONTCOPY) { > vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); > continue; > I don't see it mentioned in the discussion, so at a risk of ruffling feathers or looking really bad I'm going to ask: is the locking of any use if the forking process is single-threaded? The singular thread in this case is occupied executing this very code, so it can't do any op in parallel. Is there anyone else who could trigger a page fault? Are these shared with other processes? Cursory reading suggests a private copy is made here, so my guess is no. But then again, I landed here freshly from the interwebz. Or in short: if nobody can mess up the state if the forking process is single-threaded, why not check for mm_users or whatever other indicator to elide the slowdown for the (arguably) most common case? If the state can be messed up anyway, that's a shame, but short explanation how would be welcome. to illustrate (totally untested): diff --git a/kernel/fork.c b/kernel/fork.c index d2e12b6d2b18..aac6b08a0b21 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -652,6 +652,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, LIST_HEAD(uf); VMA_ITERATOR(old_vmi, oldmm, 0); VMA_ITERATOR(vmi, mm, 0); + bool singlethread = READ_ONCE(oldmm->mm_users) == 1; uprobe_start_dup_mmap(); if (mmap_write_lock_killable(oldmm)) { @@ -686,7 +687,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, for_each_vma(old_vmi, mpnt) { struct file *file; - vma_start_write(mpnt); + if (!singelthreaded) + vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue;
On Fri, 4 Aug 2023 at 14:46, Mateusz Guzik <mjguzik@gmail.com> wrote: > > I don't see it mentioned in the discussion, so at a risk of ruffling > feathers or looking really bad I'm going to ask: is the locking of any > use if the forking process is single-threaded? T Sadly, we've always been able to access the mm from other processes, so the locking is - I think - unavoidable. And some of those "access from other processes" aren't even uncommon or special. It's things like "ps" etc, that do it just to see the process name and arguments. Linus
On 8/5/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 4 Aug 2023 at 14:46, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> I don't see it mentioned in the discussion, so at a risk of ruffling >> feathers or looking really bad I'm going to ask: is the locking of any >> use if the forking process is single-threaded? T > > Sadly, we've always been able to access the mm from other processes, > so the locking is - I think - unavoidable. > > And some of those "access from other processes" aren't even uncommon > or special. It's things like "ps" etc, that do it just to see the > process name and arguments. > I know of these guys, I think they are excluded as is -- they go through access_remote_vm, starting with: if (mmap_read_lock_killable(mm)) return 0; while dup_mmap already write locks the parent's mm. I don't see any surprise relocks of the semaphore. Granted, should someone *bypass* this mechanism the above would be moot.
On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote: > > I know of these guys, I think they are excluded as is -- they go > through access_remote_vm, starting with: > if (mmap_read_lock_killable(mm)) > return 0; > > while dup_mmap already write locks the parent's mm. Oh, you're only worried about vma_start_write()? That's a non-issue. It doesn't take the lock normally, since it starts off with if (__is_vma_write_locked(vma, &mm_lock_seq)) return; which catches on the lock sequence number already being set. So no extra locking there. Well, technically there's extra locking because the code stupidly doesn't initialize new vma allocations to the right sequence number, but that was talked about here: https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/ and it's a separate issue. Linus
On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > I know of these guys, I think they are excluded as is -- they go > > through access_remote_vm, starting with: > > if (mmap_read_lock_killable(mm)) > > return 0; > > > > while dup_mmap already write locks the parent's mm. > > Oh, you're only worried about vma_start_write()? > > That's a non-issue. It doesn't take the lock normally, since it starts off with > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > return; > > which catches on the lock sequence number already being set. That check will prevent re-locking but if vma is not already locked then the call will proceed with obtaining the lock and setting vma->vm_lock_seq to mm->mm_lock_seq. > > So no extra locking there. > > Well, technically there's extra locking because the code stupidly > doesn't initialize new vma allocations to the right sequence number, > but that was talked about here: > > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/ > > and it's a separate issue. > > Linus
On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > I know of these guys, I think they are excluded as is -- they go > > > through access_remote_vm, starting with: > > > if (mmap_read_lock_killable(mm)) > > > return 0; > > > > > > while dup_mmap already write locks the parent's mm. > > > > Oh, you're only worried about vma_start_write()? > > > > That's a non-issue. It doesn't take the lock normally, since it starts off with > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > return; > > > > which catches on the lock sequence number already being set. > > That check will prevent re-locking but if vma is not already locked > then the call will proceed with obtaining the lock and setting > vma->vm_lock_seq to mm->mm_lock_seq. The optimization Mateusz describes looks valid to me. If there is nobody else to fault a page and mm_users is stable (which I think it is because we are holding mmap_lock for write) then we can skip vma locking, I think. > > > > > So no extra locking there. > > > > Well, technically there's extra locking because the code stupidly > > doesn't initialize new vma allocations to the right sequence number, > > but that was talked about here: > > > > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/ > > > > and it's a separate issue. > > > > Linus
On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote: > On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan <surenb@google.com> > wrote: >> >> On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >> > >> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote: >> > > >> > > I know of these guys, I think they are excluded as is -- they go >> > > through access_remote_vm, starting with: >> > > if (mmap_read_lock_killable(mm)) >> > > return 0; >> > > >> > > while dup_mmap already write locks the parent's mm. >> > >> > Oh, you're only worried about vma_start_write()? >> > >> > That's a non-issue. It doesn't take the lock normally, since it starts >> > off with >> > >> > if (__is_vma_write_locked(vma, &mm_lock_seq)) >> > return; >> > >> > which catches on the lock sequence number already being set. >> >> That check will prevent re-locking but if vma is not already locked >> then the call will proceed with obtaining the lock and setting >> vma->vm_lock_seq to mm->mm_lock_seq. > > The optimization Mateusz describes looks valid to me. If there is > nobody else to fault a page and mm_users is stable (which I think it > is because we are holding mmap_lock for write) then we can skip vma > locking, I think. > mm_users is definitely *not* stable -- it can be bumped by get_task_mm, which is only synchronized with task lock. However, the other users (that I know of ) go through the mmap semaphore to mess with anything which means they will wait for dup_mmap to finish (or do their work first). I would be surprised if there were any cases which don't take the semaphore, given that it was a requirement prior to the vma patchset (unless you patched some to no longer need it?). I would guess worst case the semaphore can be added if missing. What is guaranteed is that if the forking process is single-threaded, there will be no threads added out of nowhere -- the only thread which could do it is busy creating one in dup_mmap. If multithreaded operation of the forking process was the only problem, that's it. >> >> > >> > So no extra locking there. >> > >> > Well, technically there's extra locking because the code stupidly >> > doesn't initialize new vma allocations to the right sequence number, >> > but that was talked about here: >> > >> > >> > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/ >> > >> > and it's a separate issue. >> > >> > Linus >
On 8/5/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> I know of these guys, I think they are excluded as is -- they go >> through access_remote_vm, starting with: >> if (mmap_read_lock_killable(mm)) >> return 0; >> >> while dup_mmap already write locks the parent's mm. > > Oh, you're only worried about vma_start_write()? > > That's a non-issue. It doesn't take the lock normally, since it starts off > with > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > return; > > which catches on the lock sequence number already being set. > > So no extra locking there. > > Well, technically there's extra locking because the code stupidly > doesn't initialize new vma allocations to the right sequence number, > but that was talked about here: > > > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/ > > and it's a separate issue. > I'm going to bet one beer this is the issue. The patch I'm responding to only consists of adding the call to vma_start_write and claims the 5% slowdown from it, while fixing crashes if the forking process is multithreaded. For the fix to work it has to lock something against the parent. VMA_ITERATOR(old_vmi, oldmm, 0); [..] for_each_vma(old_vmi, mpnt) { [..] vma_start_write(mpnt); the added line locks an obj in the parent's vm space. The problem you linked looks like pessimization for freshly allocated vmas, but that's what is being operated on here.
On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan <surenb@google.com> > > wrote: > >> > >> On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds > >> <torvalds@linux-foundation.org> wrote: > >> > > >> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote: > >> > > > >> > > I know of these guys, I think they are excluded as is -- they go > >> > > through access_remote_vm, starting with: > >> > > if (mmap_read_lock_killable(mm)) > >> > > return 0; > >> > > > >> > > while dup_mmap already write locks the parent's mm. > >> > > >> > Oh, you're only worried about vma_start_write()? > >> > > >> > That's a non-issue. It doesn't take the lock normally, since it starts > >> > off with > >> > > >> > if (__is_vma_write_locked(vma, &mm_lock_seq)) > >> > return; > >> > > >> > which catches on the lock sequence number already being set. > >> > >> That check will prevent re-locking but if vma is not already locked > >> then the call will proceed with obtaining the lock and setting > >> vma->vm_lock_seq to mm->mm_lock_seq. > > > > The optimization Mateusz describes looks valid to me. If there is > > nobody else to fault a page and mm_users is stable (which I think it > > is because we are holding mmap_lock for write) then we can skip vma > > locking, I think. > > > > mm_users is definitely *not* stable -- it can be bumped by > get_task_mm, which is only synchronized with task lock. Ugh, you are of course correct. Poor choice for saying no new users (threads) can appear from under us. > > However, the other users (that I know of ) go through the mmap > semaphore to mess with anything which means they will wait for > dup_mmap to finish (or do their work first). I would be surprised if > there were any cases which don't take the semaphore, given that it was > a requirement prior to the vma patchset (unless you patched some to no > longer need it?). I would guess worst case the semaphore can be added > if missing. No, the only mmap_lock read-lock that is affected is during the page fault, which is expected. > > What is guaranteed is that if the forking process is single-threaded, > there will be no threads added out of nowhere -- the only thread which > could do it is busy creating one in dup_mmap. If multithreaded > operation of the forking process was the only problem, that's it. > > >> > >> > > >> > So no extra locking there. > >> > > >> > Well, technically there's extra locking because the code stupidly > >> > doesn't initialize new vma allocations to the right sequence number, > >> > but that was talked about here: > >> > > >> > > >> > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/ > >> > > >> > and it's a separate issue. > >> > > >> > Linus > > > > > -- > Mateusz Guzik <mjguzik gmail.com>
On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote: > On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >> However, the other users (that I know of ) go through the mmap >> semaphore to mess with anything which means they will wait for >> dup_mmap to finish (or do their work first). I would be surprised if >> there were any cases which don't take the semaphore, given that it was >> a requirement prior to the vma patchset (unless you patched some to no >> longer need it?). I would guess worst case the semaphore can be added >> if missing. > > No, the only mmap_lock read-lock that is affected is during the page > fault, which is expected. > I have difficulty parsing your statement. I am saying that any 3rd parties which can trigger page faults already read lock mmap_lock or can be made to do it (and I don't know any case which does not already, but I'm not willing to spend time poking around to make sure). One can consider 3rd parties as not a problem, modulo the audit. Past that there does is no known source of trouble? In my original e-mail I was worried about processes up the chain in ancestry, perhaps some of the state is shared(?) and the locking at hand neuters any problems. I'm guessing this is not necessary. Bottom line though it looks like this will work fine? That said, I'm not going to submit a patch I can't confidently defend. As I did not dig into any of the VMA code and can't be arsed to audit all places which mess with "foreign" mm, I'm definitely not submitting this myself. You are most welcome to write your own variant at your leisure. :)
On Fri, Aug 4, 2023 at 6:17 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > >> However, the other users (that I know of ) go through the mmap > >> semaphore to mess with anything which means they will wait for > >> dup_mmap to finish (or do their work first). I would be surprised if > >> there were any cases which don't take the semaphore, given that it was > >> a requirement prior to the vma patchset (unless you patched some to no > >> longer need it?). I would guess worst case the semaphore can be added > >> if missing. > > > > No, the only mmap_lock read-lock that is affected is during the page > > fault, which is expected. > > > > I have difficulty parsing your statement. I was just saying that vma lock patchset did not touch any other mmap_locking paths except for the page fault one where we try to skip read-locking mmap_lock. > > I am saying that any 3rd parties which can trigger page faults already > read lock mmap_lock or can be made to do it (and I don't know any case > which does not already, but I'm not willing to spend time poking > around to make sure). One can consider 3rd parties as not a problem, > modulo the audit. > > Past that there does is no known source of trouble? In my original > e-mail I was worried about processes up the chain in ancestry, perhaps > some of the state is shared(?) and the locking at hand neuters any > problems. I'm guessing this is not necessary. > > Bottom line though it looks like this will work fine? > > That said, I'm not going to submit a patch I can't confidently defend. > As I did not dig into any of the VMA code and can't be arsed to audit > all places which mess with "foreign" mm, I'm definitely not submitting > this myself. You are most welcome to write your own variant at your > leisure. :) Ok, I see. I'll need to double check locking when a 3rd party is involved. Will post a patch when I'm confident enough it's safe. Thanks! > > -- > Mateusz Guzik <mjguzik gmail.com>
On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On 8/5/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote: > >> > >> I know of these guys, I think they are excluded as is -- they go > >> through access_remote_vm, starting with: > >> if (mmap_read_lock_killable(mm)) > >> return 0; > >> > >> while dup_mmap already write locks the parent's mm. > > > > Oh, you're only worried about vma_start_write()? > > > > That's a non-issue. It doesn't take the lock normally, since it starts off > > with > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > return; > > > > which catches on the lock sequence number already being set. > > > > So no extra locking there. > > > > Well, technically there's extra locking because the code stupidly > > doesn't initialize new vma allocations to the right sequence number, > > but that was talked about here: > > > > > > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/ > > > > and it's a separate issue. > > > > I'm going to bet one beer this is the issue. > > The patch I'm responding to only consists of adding the call to > vma_start_write and claims the 5% slowdown from it, while fixing > crashes if the forking process is multithreaded. > > For the fix to work it has to lock something against the parent. > > VMA_ITERATOR(old_vmi, oldmm, 0); > [..] > for_each_vma(old_vmi, mpnt) { > [..] > vma_start_write(mpnt); > > the added line locks an obj in the parent's vm space. > > The problem you linked looks like pessimization for freshly allocated > vmas, but that's what is being operated on here. Sorry, now I'm having trouble understanding the problem you are describing. We are locking the parent's vma before copying it and the newly created vma is locked before it's added into the vma tree. What is the problem then? > > -- > Mateusz Guzik <mjguzik gmail.com>
On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote: > On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> On 8/5/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> >> >> I know of these guys, I think they are excluded as is -- they go >> >> through access_remote_vm, starting with: >> >> if (mmap_read_lock_killable(mm)) >> >> return 0; >> >> >> >> while dup_mmap already write locks the parent's mm. >> > >> > Oh, you're only worried about vma_start_write()? >> > >> > That's a non-issue. It doesn't take the lock normally, since it starts >> > off >> > with >> > >> > if (__is_vma_write_locked(vma, &mm_lock_seq)) >> > return; >> > >> > which catches on the lock sequence number already being set. >> > >> > So no extra locking there. >> > >> > Well, technically there's extra locking because the code stupidly >> > doesn't initialize new vma allocations to the right sequence number, >> > but that was talked about here: >> > >> > >> > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/ >> > >> > and it's a separate issue. >> > >> >> I'm going to bet one beer this is the issue. >> >> The patch I'm responding to only consists of adding the call to >> vma_start_write and claims the 5% slowdown from it, while fixing >> crashes if the forking process is multithreaded. >> >> For the fix to work it has to lock something against the parent. >> >> VMA_ITERATOR(old_vmi, oldmm, 0); >> [..] >> for_each_vma(old_vmi, mpnt) { >> [..] >> vma_start_write(mpnt); >> >> the added line locks an obj in the parent's vm space. >> >> The problem you linked looks like pessimization for freshly allocated >> vmas, but that's what is being operated on here. > > Sorry, now I'm having trouble understanding the problem you are > describing. We are locking the parent's vma before copying it and the > newly created vma is locked before it's added into the vma tree. What > is the problem then? > Sorry for the late reply! Looks there has been a bunch of weird talking past one another in this thread and I don't think trying to straighten it all out is worth any time. I think at least the two of us agree that if a single-threaded process enters dup_mmap an down_writes the mmap semaphore, then no new thread can pop up in said process, thus no surprise page faults from that angle. 3rd parties are supposed to interfaces like access_remote_vm, which down_read said semaphore and are consequently also not a problem. The only worry here is that someone is messing with another process memory without the semaphore, but is very unlikely and patchable in the worst case -- but someone(tm) has to audit. With all these conditions satisfied one can elide vma_start_write for a perf win. Finally, I think we agreed you are going to do the audit ;) Cheers,
On Wed, Aug 9, 2023 at 2:07 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On 8/5/23, Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > >> > >> On 8/5/23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik <mjguzik@gmail.com> wrote: > >> >> > >> >> I know of these guys, I think they are excluded as is -- they go > >> >> through access_remote_vm, starting with: > >> >> if (mmap_read_lock_killable(mm)) > >> >> return 0; > >> >> > >> >> while dup_mmap already write locks the parent's mm. > >> > > >> > Oh, you're only worried about vma_start_write()? > >> > > >> > That's a non-issue. It doesn't take the lock normally, since it starts > >> > off > >> > with > >> > > >> > if (__is_vma_write_locked(vma, &mm_lock_seq)) > >> > return; > >> > > >> > which catches on the lock sequence number already being set. > >> > > >> > So no extra locking there. > >> > > >> > Well, technically there's extra locking because the code stupidly > >> > doesn't initialize new vma allocations to the right sequence number, > >> > but that was talked about here: > >> > > >> > > >> > https://lore.kernel.org/all/CAHk-=wiCrWAoEesBuoGoqqufvesicbGp3cX0LyKgEvsFaZNpDA@mail.gmail.com/ > >> > > >> > and it's a separate issue. > >> > > >> > >> I'm going to bet one beer this is the issue. > >> > >> The patch I'm responding to only consists of adding the call to > >> vma_start_write and claims the 5% slowdown from it, while fixing > >> crashes if the forking process is multithreaded. > >> > >> For the fix to work it has to lock something against the parent. > >> > >> VMA_ITERATOR(old_vmi, oldmm, 0); > >> [..] > >> for_each_vma(old_vmi, mpnt) { > >> [..] > >> vma_start_write(mpnt); > >> > >> the added line locks an obj in the parent's vm space. > >> > >> The problem you linked looks like pessimization for freshly allocated > >> vmas, but that's what is being operated on here. > > > > Sorry, now I'm having trouble understanding the problem you are > > describing. We are locking the parent's vma before copying it and the > > newly created vma is locked before it's added into the vma tree. What > > is the problem then? > > > > Sorry for the late reply! > > Looks there has been a bunch of weird talking past one another in this > thread and I don't think trying to straighten it all out is worth any > time. > > I think at least the two of us agree that if a single-threaded process > enters dup_mmap an > down_writes the mmap semaphore, then no new thread can pop up in said > process, thus no surprise page faults from that angle. 3rd parties are > supposed to interfaces like access_remote_vm, which down_read said > semaphore and are consequently also not a problem. The only worry here > is that someone is messing with another process memory without the > semaphore, but is very unlikely and patchable in the worst case -- but > someone(tm) has to audit. With all these conditions satisfied one can > elide vma_start_write for a perf win. > > Finally, I think we agreed you are going to do the audit ;) Ack. I'll look into this once the dust settles. Thanks! > > Cheers, > -- > Mateusz Guzik <mjguzik gmail.com>
diff --git a/kernel/fork.c b/kernel/fork.c index b85814e614a5..d2e12b6d2b18 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, for_each_vma(old_vmi, mpnt) { struct file *file; + vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue;
When forking a child process, parent write-protects an anonymous page and COW-shares it with the child being forked using copy_present_pte(). Parent's TLB is flushed right before we drop the parent's mmap_lock in dup_mmap(). If we get a write-fault before that TLB flush in the parent, and we end up replacing that anonymous page in the parent process in do_wp_page() (because, COW-shared with the child), this might lead to some stale writable TLB entries targeting the wrong (old) page. Similar issue happened in the past with userfaultfd (see flush_tlb_page() call inside do_wp_page()). Lock VMAs of the parent process when forking a child, which prevents concurrent page faults during fork operation and avoids this issue. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic. Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: Jiri Slaby <jirislaby@kernel.org> Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/ Reported-by: Jacob Young <jacobly.alt@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- kernel/fork.c | 1 + 1 file changed, 1 insertion(+)