Message ID | CAGudoHFdPd5a7fiAutTFAxQz5f1fP2n9rwORm7Hj9fPzuVhiKw@mail.gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | not issuing vma_start_write() in dup_mmap() if the caller is single-threaded | expand |
Hi Mateusz, On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > Hi Suren, > > I brought this up long time ago, you agreed the idea works and stated > you will sort it out, but it apparently fell to the wayside (no hard > feelings though :>) Sorry about that. Yeah, that was a hectic time and I completely forgot to look into the single-user case. > > Here is the original: > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > the thread is weirdly long and I recommend not opening it without a > good reason, I link it for reference if needed. I had to re-read it to remember what it was all about :) To bring others up-to-speed, the suggested change would look something like this: @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, unsigned long charge = 0; LIST_HEAD(uf); VMA_ITERATOR(vmi, mm, 0); + bool only_user; if (mmap_write_lock_killable(oldmm)) return -EINTR; + only_user = atomic_read(&oldmm->mm_users) == 1; flush_cache_dup_mm(oldmm); uprobe_dup_mmap(oldmm, mm); /* @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, mt_clear_in_rcu(vmi.mas.tree); for_each_vma(vmi, mpnt) { struct file *file; - vma_start_write(mpnt); + if (!only_user) + vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, mpnt->vm_end, GFP_KERNEL); > > At the time there were woes concerning stability of the new locking > scheme, resulting in CC list being rather excessive. As such I'm > starting a new thread with a modest list, not sure who to add though. > > So dup_mmap() holds the caller mmap_sem for writing and calls > vma_start_write() to protect against fault handling in another threads > using the same mm. > > If this is the only thread with this ->mm in place, there are no > sibling threads to worry about and this can be checked with mm_users > == 1. > > AFAICS all remote accesses require the mmap_sem to also be held, which > provides exclusion against dup_mmap, meaning they don't pose a problem > either. Yes, I believe the optimization you proposed would be safe. > > The patch to merely skip locking is a few liner and I would officially > submit it myself, but for my taste an assert is needed in fault > handling to runtime test the above invariant. Hmm. Adding an assert in the pagefault path that checks whether the fault is triggered during dup_mmap() && mm_users==1 would not be trivial. We would need to indicate that we expect no page faults while in that section of dup_mmap() (maybe a flag inside mm_struct) and then assert that this flag is not set inside the pagefault handling path. I'm not sure it's worth the complexity... As was discussed in that thread, the only other task that might fault a page would be external and therefore would have to go through access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() already holds mmap_write_lock the new user would have to wait. > So happens I really > can't be bothered to figure out how to sort it out and was hoping you > would step in. ;) Alternatively if you guys don't think the assert is > warranted, that's your business. I don't think it's worth it but I'll CC Matthew and Lorenzo (you already CC'ed Liam) to get their opinion. > > As for whether this can save any locking -- yes: Yeah, I'm sure it will make a difference in performance. While forking we are currently locking each VMA separately, so skipping that would be nice. Folks, WDYT? Do we need a separate assertion that pagefault can't happen if mm_users==1 and we are holding mmap_write_lock (access_remote_vm() will block)? Thanks, Suren. > > I added a probe (below for reference) with two args: whether we are > single-threaded and vma_start_write() returning whether it took the > down/up cycle and ran make -j 20 in the kernel dir. > > The lock was taken for every single vma (377913 in total), while all > forking processes were single-threaded. Or to put it differently all > of these were skippable. > > the probe (total hack): > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > probe diff: > diff --git a/fs/namei.c b/fs/namei.c > index ecb7b95c2ca3..d6cde76eda81 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -5459,3 +5459,7 @@ const struct inode_operations > page_symlink_inode_operations = { > .get_link = page_get_link, > }; > EXPORT_SYMBOL(page_symlink_inode_operations); > + > +void dup_probe(int, int); > +void dup_probe(int, int) { } > +EXPORT_SYMBOL(dup_probe); > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 1f80baddacc5..f7b1f0a02f2e 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > vm_area_struct *vma, unsigned int *mm_l > * Exclude concurrent readers under the per-VMA lock until the currently > * write-locked mmap_lock is dropped or downgraded. > */ > -static inline void vma_start_write(struct vm_area_struct *vma) > +static inline bool vma_start_write(struct vm_area_struct *vma) > { > unsigned int mm_lock_seq; > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > - return; > + return false; > > down_write(&vma->vm_lock->lock); > /* > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > vm_area_struct *vma) > */ > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > up_write(&vma->vm_lock->lock); > + return true; > } > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > diff --git a/kernel/fork.c b/kernel/fork.c > index 735405a9c5f3..0cc56255a339 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > struct mm_struct *oldmm) > pr_warn_once("exe_file_deny_write_access() failed in > %s\n", __func__); > } > > +void dup_probe(int, int); > + > #ifdef CONFIG_MMU > static __latent_entropy int dup_mmap(struct mm_struct *mm, > struct mm_struct *oldmm) > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > unsigned long charge = 0; > LIST_HEAD(uf); > VMA_ITERATOR(vmi, mm, 0); > + bool only_user; > > if (mmap_write_lock_killable(oldmm)) > return -EINTR; > + only_user = atomic_read(&oldmm->mm_users) == 1; > flush_cache_dup_mm(oldmm); > uprobe_dup_mmap(oldmm, mm); > /* > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > mt_clear_in_rcu(vmi.mas.tree); > for_each_vma(vmi, mpnt) { > struct file *file; > + bool locked; > + > + locked = vma_start_write(mpnt); > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > - vma_start_write(mpnt); > if (mpnt->vm_flags & VM_DONTCOPY) { > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > mpnt->vm_end, GFP_KERNEL); > > -- > Mateusz Guzik <mjguzik gmail.com>
On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote: > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > Here is the original: > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > the thread is weirdly long and I recommend not opening it without a > > good reason, I link it for reference if needed. > > I had to re-read it to remember what it was all about :) To bring > others up-to-speed, the suggested change would look something like > this: > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > unsigned long charge = 0; > LIST_HEAD(uf); > VMA_ITERATOR(vmi, mm, 0); > + bool only_user; > > if (mmap_write_lock_killable(oldmm)) > return -EINTR; > + only_user = atomic_read(&oldmm->mm_users) == 1; > flush_cache_dup_mm(oldmm); > uprobe_dup_mmap(oldmm, mm); > /* > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > mt_clear_in_rcu(vmi.mas.tree); > for_each_vma(vmi, mpnt) { > struct file *file; > > - vma_start_write(mpnt); > + if (!only_user) > + vma_start_write(mpnt); > if (mpnt->vm_flags & VM_DONTCOPY) { > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > mpnt->vm_end, GFP_KERNEL); > > > > > At the time there were woes concerning stability of the new locking > > scheme, resulting in CC list being rather excessive. As such I'm > > starting a new thread with a modest list, not sure who to add though. > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > vma_start_write() to protect against fault handling in another threads > > using the same mm. > > > > If this is the only thread with this ->mm in place, there are no > > sibling threads to worry about and this can be checked with mm_users > > == 1. > > > > AFAICS all remote accesses require the mmap_sem to also be held, which > > provides exclusion against dup_mmap, meaning they don't pose a problem > > either. > > Yes, I believe the optimization you proposed would be safe. > > > > > The patch to merely skip locking is a few liner and I would officially > > submit it myself, but for my taste an assert is needed in fault > > handling to runtime test the above invariant. > > Hmm. Adding an assert in the pagefault path that checks whether the > fault is triggered during dup_mmap() && mm_users==1 would not be > trivial. We would need to indicate that we expect no page faults while > in that section of dup_mmap() (maybe a flag inside mm_struct) and then > assert that this flag is not set inside the pagefault handling path. > I'm not sure it's worth the complexity... As was discussed in that > thread, the only other task that might fault a page would be external > and therefore would have to go through > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > already holds mmap_write_lock the new user would have to wait. > > > So happens I really > > can't be bothered to figure out how to sort it out and was hoping you > > would step in. ;) Alternatively if you guys don't think the assert is > > warranted, that's your business. > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > already CC'ed Liam) to get their opinion. > > > > > As for whether this can save any locking -- yes: > > Yeah, I'm sure it will make a difference in performance. While forking > we are currently locking each VMA separately, so skipping that would > be nice. > Folks, WDYT? Do we need a separate assertion that pagefault can't > happen if mm_users==1 and we are holding mmap_write_lock > (access_remote_vm() will block)? pseudocode-wise I was thinking in the lines of the following in the fault path: if (current->mm != vma->vm_mm) mmap_assert_locked(vma->vm_mm); with the assumption that mmap_assert_locked expands to nothing without debug > > > > > I added a probe (below for reference) with two args: whether we are > > single-threaded and vma_start_write() returning whether it took the > > down/up cycle and ran make -j 20 in the kernel dir. > > > > The lock was taken for every single vma (377913 in total), while all > > forking processes were single-threaded. Or to put it differently all > > of these were skippable. > > > > the probe (total hack): > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > > > probe diff: > > diff --git a/fs/namei.c b/fs/namei.c > > index ecb7b95c2ca3..d6cde76eda81 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > page_symlink_inode_operations = { > > .get_link = page_get_link, > > }; > > EXPORT_SYMBOL(page_symlink_inode_operations); > > + > > +void dup_probe(int, int); > > +void dup_probe(int, int) { } > > +EXPORT_SYMBOL(dup_probe); > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > vm_area_struct *vma, unsigned int *mm_l > > * Exclude concurrent readers under the per-VMA lock until the currently > > * write-locked mmap_lock is dropped or downgraded. > > */ > > -static inline void vma_start_write(struct vm_area_struct *vma) > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > { > > unsigned int mm_lock_seq; > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > - return; > > + return false; > > > > down_write(&vma->vm_lock->lock); > > /* > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > vm_area_struct *vma) > > */ > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > up_write(&vma->vm_lock->lock); > > + return true; > > } > > > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > > diff --git a/kernel/fork.c b/kernel/fork.c > > index 735405a9c5f3..0cc56255a339 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > > struct mm_struct *oldmm) > > pr_warn_once("exe_file_deny_write_access() failed in > > %s\n", __func__); > > } > > > > +void dup_probe(int, int); > > + > > #ifdef CONFIG_MMU > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > struct mm_struct *oldmm) > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > unsigned long charge = 0; > > LIST_HEAD(uf); > > VMA_ITERATOR(vmi, mm, 0); > > + bool only_user; > > > > if (mmap_write_lock_killable(oldmm)) > > return -EINTR; > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > flush_cache_dup_mm(oldmm); > > uprobe_dup_mmap(oldmm, mm); > > /* > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > mt_clear_in_rcu(vmi.mas.tree); > > for_each_vma(vmi, mpnt) { > > struct file *file; > > + bool locked; > > + > > + locked = vma_start_write(mpnt); > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > - vma_start_write(mpnt); > > if (mpnt->vm_flags & VM_DONTCOPY) { > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > mpnt->vm_end, GFP_KERNEL); > > > > -- > > Mateusz Guzik <mjguzik gmail.com>
On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > Here is the original: > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > > > the thread is weirdly long and I recommend not opening it without a > > > good reason, I link it for reference if needed. > > > > I had to re-read it to remember what it was all about :) To bring > > others up-to-speed, the suggested change would look something like > > this: > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > unsigned long charge = 0; > > LIST_HEAD(uf); > > VMA_ITERATOR(vmi, mm, 0); > > + bool only_user; > > > > if (mmap_write_lock_killable(oldmm)) > > return -EINTR; > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > flush_cache_dup_mm(oldmm); > > uprobe_dup_mmap(oldmm, mm); > > /* > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > mt_clear_in_rcu(vmi.mas.tree); > > for_each_vma(vmi, mpnt) { > > struct file *file; > > > > - vma_start_write(mpnt); > > + if (!only_user) > > + vma_start_write(mpnt); > > if (mpnt->vm_flags & VM_DONTCOPY) { > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > At the time there were woes concerning stability of the new locking > > > scheme, resulting in CC list being rather excessive. As such I'm > > > starting a new thread with a modest list, not sure who to add though. > > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > > vma_start_write() to protect against fault handling in another threads > > > using the same mm. > > > > > > If this is the only thread with this ->mm in place, there are no > > > sibling threads to worry about and this can be checked with mm_users > > > == 1. > > > > > > AFAICS all remote accesses require the mmap_sem to also be held, which > > > provides exclusion against dup_mmap, meaning they don't pose a problem > > > either. > > > > Yes, I believe the optimization you proposed would be safe. > > > > > > > > The patch to merely skip locking is a few liner and I would officially > > > submit it myself, but for my taste an assert is needed in fault > > > handling to runtime test the above invariant. > > > > Hmm. Adding an assert in the pagefault path that checks whether the > > fault is triggered during dup_mmap() && mm_users==1 would not be > > trivial. We would need to indicate that we expect no page faults while > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then > > assert that this flag is not set inside the pagefault handling path. > > I'm not sure it's worth the complexity... As was discussed in that > > thread, the only other task that might fault a page would be external > > and therefore would have to go through > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > > already holds mmap_write_lock the new user would have to wait. > > > > > So happens I really > > > can't be bothered to figure out how to sort it out and was hoping you > > > would step in. ;) Alternatively if you guys don't think the assert is > > > warranted, that's your business. > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > > already CC'ed Liam) to get their opinion. > > > > > > > > As for whether this can save any locking -- yes: > > > > Yeah, I'm sure it will make a difference in performance. While forking > > we are currently locking each VMA separately, so skipping that would > > be nice. > > Folks, WDYT? Do we need a separate assertion that pagefault can't > > happen if mm_users==1 and we are holding mmap_write_lock > > (access_remote_vm() will block)? > > pseudocode-wise I was thinking in the lines of the following in the fault path: > > if (current->mm != vma->vm_mm) > mmap_assert_locked(vma->vm_mm); > > with the assumption that mmap_assert_locked expands to nothing without debug I see. IOW, if someone external is faulting a page then it has to be holding at least mmap_read_lock. So, it's a more general check but seems reasonable. I think adding it at the end of lock_vma_under_rcu() under the "#ifdef CONFIG_DEBUG_VM" condition would be enough. > > > > > > > > > I added a probe (below for reference) with two args: whether we are > > > single-threaded and vma_start_write() returning whether it took the > > > down/up cycle and ran make -j 20 in the kernel dir. > > > > > > The lock was taken for every single vma (377913 in total), while all > > > forking processes were single-threaded. Or to put it differently all > > > of these were skippable. > > > > > > the probe (total hack): > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > > > > > probe diff: > > > diff --git a/fs/namei.c b/fs/namei.c > > > index ecb7b95c2ca3..d6cde76eda81 100644 > > > --- a/fs/namei.c > > > +++ b/fs/namei.c > > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > > page_symlink_inode_operations = { > > > .get_link = page_get_link, > > > }; > > > EXPORT_SYMBOL(page_symlink_inode_operations); > > > + > > > +void dup_probe(int, int); > > > +void dup_probe(int, int) { } > > > +EXPORT_SYMBOL(dup_probe); > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > > --- a/include/linux/mm.h > > > +++ b/include/linux/mm.h > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > > vm_area_struct *vma, unsigned int *mm_l > > > * Exclude concurrent readers under the per-VMA lock until the currently > > > * write-locked mmap_lock is dropped or downgraded. > > > */ > > > -static inline void vma_start_write(struct vm_area_struct *vma) > > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > > { > > > unsigned int mm_lock_seq; > > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > - return; > > > + return false; > > > > > > down_write(&vma->vm_lock->lock); > > > /* > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > > vm_area_struct *vma) > > > */ > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > up_write(&vma->vm_lock->lock); > > > + return true; > > > } > > > > > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > index 735405a9c5f3..0cc56255a339 100644 > > > --- a/kernel/fork.c > > > +++ b/kernel/fork.c > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > > > struct mm_struct *oldmm) > > > pr_warn_once("exe_file_deny_write_access() failed in > > > %s\n", __func__); > > > } > > > > > > +void dup_probe(int, int); > > > + > > > #ifdef CONFIG_MMU > > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > struct mm_struct *oldmm) > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > unsigned long charge = 0; > > > LIST_HEAD(uf); > > > VMA_ITERATOR(vmi, mm, 0); > > > + bool only_user; > > > > > > if (mmap_write_lock_killable(oldmm)) > > > return -EINTR; > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > flush_cache_dup_mm(oldmm); > > > uprobe_dup_mmap(oldmm, mm); > > > /* > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > mt_clear_in_rcu(vmi.mas.tree); > > > for_each_vma(vmi, mpnt) { > > > struct file *file; > > > + bool locked; > > > + > > > + locked = vma_start_write(mpnt); > > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > > > - vma_start_write(mpnt); > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > mpnt->vm_end, GFP_KERNEL); > > > > > > -- > > > Mateusz Guzik <mjguzik gmail.com> > > > > -- > Mateusz Guzik <mjguzik gmail.com>
top post warning It just hit me that userfaultfd() may be a problem. Creating it only bumps mm_count, while mm_users gets transiently modified during various ops. So in particular you may end up pounding off the userfaultfd instance to another process while being single-threaded, then mm_users == 1 and userfaultfd may be trying to do something. I have no idea if this one is guaranteed to take the lock. However, the good news is that mm_count tends to be 1. If both mm_count and mm_users are 1, then there is no usefaultfd in use and nobody to add it either. State of mm_count verified with: bpftrace -e 'kprobe:copy_process { @[curtask->mm->mm_count.counter] = count(); }' On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > Here is the original: > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > > > > > the thread is weirdly long and I recommend not opening it without a > > > > good reason, I link it for reference if needed. > > > > > > I had to re-read it to remember what it was all about :) To bring > > > others up-to-speed, the suggested change would look something like > > > this: > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > unsigned long charge = 0; > > > LIST_HEAD(uf); > > > VMA_ITERATOR(vmi, mm, 0); > > > + bool only_user; > > > > > > if (mmap_write_lock_killable(oldmm)) > > > return -EINTR; > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > flush_cache_dup_mm(oldmm); > > > uprobe_dup_mmap(oldmm, mm); > > > /* > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > mt_clear_in_rcu(vmi.mas.tree); > > > for_each_vma(vmi, mpnt) { > > > struct file *file; > > > > > > - vma_start_write(mpnt); > > > + if (!only_user) > > > + vma_start_write(mpnt); > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > > > > At the time there were woes concerning stability of the new locking > > > > scheme, resulting in CC list being rather excessive. As such I'm > > > > starting a new thread with a modest list, not sure who to add though. > > > > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > > > vma_start_write() to protect against fault handling in another threads > > > > using the same mm. > > > > > > > > If this is the only thread with this ->mm in place, there are no > > > > sibling threads to worry about and this can be checked with mm_users > > > > == 1. > > > > > > > > AFAICS all remote accesses require the mmap_sem to also be held, which > > > > provides exclusion against dup_mmap, meaning they don't pose a problem > > > > either. > > > > > > Yes, I believe the optimization you proposed would be safe. > > > > > > > > > > > The patch to merely skip locking is a few liner and I would officially > > > > submit it myself, but for my taste an assert is needed in fault > > > > handling to runtime test the above invariant. > > > > > > Hmm. Adding an assert in the pagefault path that checks whether the > > > fault is triggered during dup_mmap() && mm_users==1 would not be > > > trivial. We would need to indicate that we expect no page faults while > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then > > > assert that this flag is not set inside the pagefault handling path. > > > I'm not sure it's worth the complexity... As was discussed in that > > > thread, the only other task that might fault a page would be external > > > and therefore would have to go through > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > > > already holds mmap_write_lock the new user would have to wait. > > > > > > > So happens I really > > > > can't be bothered to figure out how to sort it out and was hoping you > > > > would step in. ;) Alternatively if you guys don't think the assert is > > > > warranted, that's your business. > > > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > > > already CC'ed Liam) to get their opinion. > > > > > > > > > > > As for whether this can save any locking -- yes: > > > > > > Yeah, I'm sure it will make a difference in performance. While forking > > > we are currently locking each VMA separately, so skipping that would > > > be nice. > > > Folks, WDYT? Do we need a separate assertion that pagefault can't > > > happen if mm_users==1 and we are holding mmap_write_lock > > > (access_remote_vm() will block)? > > > > pseudocode-wise I was thinking in the lines of the following in the fault path: > > > > if (current->mm != vma->vm_mm) > > mmap_assert_locked(vma->vm_mm); > > > > with the assumption that mmap_assert_locked expands to nothing without debug > > I see. IOW, if someone external is faulting a page then it has to be > holding at least mmap_read_lock. So, it's a more general check but > seems reasonable. I think adding it at the end of lock_vma_under_rcu() > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough. > > > > > > > > > > > > > > I added a probe (below for reference) with two args: whether we are > > > > single-threaded and vma_start_write() returning whether it took the > > > > down/up cycle and ran make -j 20 in the kernel dir. > > > > > > > > The lock was taken for every single vma (377913 in total), while all > > > > forking processes were single-threaded. Or to put it differently all > > > > of these were skippable. > > > > > > > > the probe (total hack): > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > > > > > > > probe diff: > > > > diff --git a/fs/namei.c b/fs/namei.c > > > > index ecb7b95c2ca3..d6cde76eda81 100644 > > > > --- a/fs/namei.c > > > > +++ b/fs/namei.c > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > > > page_symlink_inode_operations = { > > > > .get_link = page_get_link, > > > > }; > > > > EXPORT_SYMBOL(page_symlink_inode_operations); > > > > + > > > > +void dup_probe(int, int); > > > > +void dup_probe(int, int) { } > > > > +EXPORT_SYMBOL(dup_probe); > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > > > --- a/include/linux/mm.h > > > > +++ b/include/linux/mm.h > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > > > vm_area_struct *vma, unsigned int *mm_l > > > > * Exclude concurrent readers under the per-VMA lock until the currently > > > > * write-locked mmap_lock is dropped or downgraded. > > > > */ > > > > -static inline void vma_start_write(struct vm_area_struct *vma) > > > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > > > { > > > > unsigned int mm_lock_seq; > > > > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > > - return; > > > > + return false; > > > > > > > > down_write(&vma->vm_lock->lock); > > > > /* > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > > > vm_area_struct *vma) > > > > */ > > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > > up_write(&vma->vm_lock->lock); > > > > + return true; > > > > } > > > > > > > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > index 735405a9c5f3..0cc56255a339 100644 > > > > --- a/kernel/fork.c > > > > +++ b/kernel/fork.c > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > > > > struct mm_struct *oldmm) > > > > pr_warn_once("exe_file_deny_write_access() failed in > > > > %s\n", __func__); > > > > } > > > > > > > > +void dup_probe(int, int); > > > > + > > > > #ifdef CONFIG_MMU > > > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > struct mm_struct *oldmm) > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > unsigned long charge = 0; > > > > LIST_HEAD(uf); > > > > VMA_ITERATOR(vmi, mm, 0); > > > > + bool only_user; > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > return -EINTR; > > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > > flush_cache_dup_mm(oldmm); > > > > uprobe_dup_mmap(oldmm, mm); > > > > /* > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > for_each_vma(vmi, mpnt) { > > > > struct file *file; > > > > + bool locked; > > > > + > > > > + locked = vma_start_write(mpnt); > > > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > > > > > - vma_start_write(mpnt); > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > -- > > > > Mateusz Guzik <mjguzik gmail.com> > > > > > > > > -- > > Mateusz Guzik <mjguzik gmail.com>
On Fri, Mar 28, 2025 at 6:51 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > top post warning > > It just hit me that userfaultfd() may be a problem. Creating it only > bumps mm_count, while mm_users gets transiently modified during > various ops. > > So in particular you may end up pounding off the userfaultfd instance > to another process while being single-threaded, then mm_users == 1 and > userfaultfd may be trying to do something. I have no idea if this one > is guaranteed to take the lock. Hmm, yeah. uffd is nasty. Even in the cases when it does mmget_not_zero(), there is still the possibility of a race. For example: dup_mmap only_user = atomic_read(&oldmm->mm_users) == 1; userfaultfd_move() mmget_not_zero() <-- inc mm_users if (!only_user) vma_start_write(mpnt); <-- gets skipped move_pages() uffd_move_lock() uffd_lock_vma() lock_vma_under_rcu() <-- succeeds move_pages_pte() copy_page_range() So, userfaultfd_move() will happily move pages while dup_mmap() is doing copy_page_range(). The copied range might look quite interesting... > > However, the good news is that mm_count tends to be 1. If both > mm_count and mm_users are 1, then there is no usefaultfd in use and > nobody to add it either. I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while calling mmgrab(), therefore I think it can race with the code checking its value. > > State of mm_count verified with: bpftrace -e 'kprobe:copy_process { > @[curtask->mm->mm_count.counter] = count(); }' > > On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > Here is the original: > > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > > > > > > > the thread is weirdly long and I recommend not opening it without a > > > > > good reason, I link it for reference if needed. > > > > > > > > I had to re-read it to remember what it was all about :) To bring > > > > others up-to-speed, the suggested change would look something like > > > > this: > > > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > unsigned long charge = 0; > > > > LIST_HEAD(uf); > > > > VMA_ITERATOR(vmi, mm, 0); > > > > + bool only_user; > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > return -EINTR; > > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > > flush_cache_dup_mm(oldmm); > > > > uprobe_dup_mmap(oldmm, mm); > > > > /* > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > for_each_vma(vmi, mpnt) { > > > > struct file *file; > > > > > > > > - vma_start_write(mpnt); > > > > + if (!only_user) > > > > + vma_start_write(mpnt); > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > > > > > > > At the time there were woes concerning stability of the new locking > > > > > scheme, resulting in CC list being rather excessive. As such I'm > > > > > starting a new thread with a modest list, not sure who to add though. > > > > > > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > > > > vma_start_write() to protect against fault handling in another threads > > > > > using the same mm. > > > > > > > > > > If this is the only thread with this ->mm in place, there are no > > > > > sibling threads to worry about and this can be checked with mm_users > > > > > == 1. > > > > > > > > > > AFAICS all remote accesses require the mmap_sem to also be held, which > > > > > provides exclusion against dup_mmap, meaning they don't pose a problem > > > > > either. > > > > > > > > Yes, I believe the optimization you proposed would be safe. > > > > > > > > > > > > > > The patch to merely skip locking is a few liner and I would officially > > > > > submit it myself, but for my taste an assert is needed in fault > > > > > handling to runtime test the above invariant. > > > > > > > > Hmm. Adding an assert in the pagefault path that checks whether the > > > > fault is triggered during dup_mmap() && mm_users==1 would not be > > > > trivial. We would need to indicate that we expect no page faults while > > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then > > > > assert that this flag is not set inside the pagefault handling path. > > > > I'm not sure it's worth the complexity... As was discussed in that > > > > thread, the only other task that might fault a page would be external > > > > and therefore would have to go through > > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > > > > already holds mmap_write_lock the new user would have to wait. > > > > > > > > > So happens I really > > > > > can't be bothered to figure out how to sort it out and was hoping you > > > > > would step in. ;) Alternatively if you guys don't think the assert is > > > > > warranted, that's your business. > > > > > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > > > > already CC'ed Liam) to get their opinion. > > > > > > > > > > > > > > As for whether this can save any locking -- yes: > > > > > > > > Yeah, I'm sure it will make a difference in performance. While forking > > > > we are currently locking each VMA separately, so skipping that would > > > > be nice. > > > > Folks, WDYT? Do we need a separate assertion that pagefault can't > > > > happen if mm_users==1 and we are holding mmap_write_lock > > > > (access_remote_vm() will block)? > > > > > > pseudocode-wise I was thinking in the lines of the following in the fault path: > > > > > > if (current->mm != vma->vm_mm) > > > mmap_assert_locked(vma->vm_mm); > > > > > > with the assumption that mmap_assert_locked expands to nothing without debug > > > > I see. IOW, if someone external is faulting a page then it has to be > > holding at least mmap_read_lock. So, it's a more general check but > > seems reasonable. I think adding it at the end of lock_vma_under_rcu() > > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough. > > > > > > > > > > > > > > > > > > > I added a probe (below for reference) with two args: whether we are > > > > > single-threaded and vma_start_write() returning whether it took the > > > > > down/up cycle and ran make -j 20 in the kernel dir. > > > > > > > > > > The lock was taken for every single vma (377913 in total), while all > > > > > forking processes were single-threaded. Or to put it differently all > > > > > of these were skippable. > > > > > > > > > > the probe (total hack): > > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > > > > > > > > > probe diff: > > > > > diff --git a/fs/namei.c b/fs/namei.c > > > > > index ecb7b95c2ca3..d6cde76eda81 100644 > > > > > --- a/fs/namei.c > > > > > +++ b/fs/namei.c > > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > > > > page_symlink_inode_operations = { > > > > > .get_link = page_get_link, > > > > > }; > > > > > EXPORT_SYMBOL(page_symlink_inode_operations); > > > > > + > > > > > +void dup_probe(int, int); > > > > > +void dup_probe(int, int) { } > > > > > +EXPORT_SYMBOL(dup_probe); > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > > > > --- a/include/linux/mm.h > > > > > +++ b/include/linux/mm.h > > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > > > > vm_area_struct *vma, unsigned int *mm_l > > > > > * Exclude concurrent readers under the per-VMA lock until the currently > > > > > * write-locked mmap_lock is dropped or downgraded. > > > > > */ > > > > > -static inline void vma_start_write(struct vm_area_struct *vma) > > > > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > > > > { > > > > > unsigned int mm_lock_seq; > > > > > > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > > > - return; > > > > > + return false; > > > > > > > > > > down_write(&vma->vm_lock->lock); > > > > > /* > > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > > > > vm_area_struct *vma) > > > > > */ > > > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > > > up_write(&vma->vm_lock->lock); > > > > > + return true; > > > > > } > > > > > > > > > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > > index 735405a9c5f3..0cc56255a339 100644 > > > > > --- a/kernel/fork.c > > > > > +++ b/kernel/fork.c > > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > > > > > struct mm_struct *oldmm) > > > > > pr_warn_once("exe_file_deny_write_access() failed in > > > > > %s\n", __func__); > > > > > } > > > > > > > > > > +void dup_probe(int, int); > > > > > + > > > > > #ifdef CONFIG_MMU > > > > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > struct mm_struct *oldmm) > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > unsigned long charge = 0; > > > > > LIST_HEAD(uf); > > > > > VMA_ITERATOR(vmi, mm, 0); > > > > > + bool only_user; > > > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > > return -EINTR; > > > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > > > flush_cache_dup_mm(oldmm); > > > > > uprobe_dup_mmap(oldmm, mm); > > > > > /* > > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > > for_each_vma(vmi, mpnt) { > > > > > struct file *file; > > > > > + bool locked; > > > > > + > > > > > + locked = vma_start_write(mpnt); > > > > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > > > > > > > - vma_start_write(mpnt); > > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > > > -- > > > > > Mateusz Guzik <mjguzik gmail.com> > > > > > > > > > > > > -- > > > Mateusz Guzik <mjguzik gmail.com> > > > > -- > Mateusz Guzik <mjguzik gmail.com>
On Sun, Mar 30, 2025 at 12:23 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Mar 28, 2025 at 6:51 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > top post warning > > > > It just hit me that userfaultfd() may be a problem. Creating it only > > bumps mm_count, while mm_users gets transiently modified during > > various ops. > > > > So in particular you may end up pounding off the userfaultfd instance > > to another process while being single-threaded, then mm_users == 1 and > > userfaultfd may be trying to do something. I have no idea if this one > > is guaranteed to take the lock. > > Hmm, yeah. uffd is nasty. Even in the cases when it does > mmget_not_zero(), there is still the possibility of a race. For > example: > > dup_mmap > only_user = atomic_read(&oldmm->mm_users) == 1; > > userfaultfd_move() > mmget_not_zero() <-- > inc mm_users > > if (!only_user) > vma_start_write(mpnt); <-- gets skipped > > move_pages() > uffd_move_lock() > uffd_lock_vma() > > lock_vma_under_rcu() <-- succeeds > move_pages_pte() > copy_page_range() Sorry about formatting. This might look a bit better: Task 1 Task 2 dup_mmap only_user = atomic_read(&oldmm->mm_users) == 1; userfaultfd_move() mmget_not_zero() if (!only_user) vma_start_write(mpnt); <-- gets skipped move_pages() uffd_move_lock() uffd_lock_vma() lock_vma_under_rcu() move_pages_pte() copy_page_range() > > So, userfaultfd_move() will happily move pages while dup_mmap() is > doing copy_page_range(). The copied range might look quite > interesting... > > > > > However, the good news is that mm_count tends to be 1. If both > > mm_count and mm_users are 1, then there is no usefaultfd in use and > > nobody to add it either. > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while > calling mmgrab(), therefore I think it can race with the code checking > its value. > > > > > State of mm_count verified with: bpftrace -e 'kprobe:copy_process { > > @[curtask->mm->mm_count.counter] = count(); }' > > > > On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > Here is the original: > > > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > > > > > > > > > the thread is weirdly long and I recommend not opening it without a > > > > > > good reason, I link it for reference if needed. > > > > > > > > > > I had to re-read it to remember what it was all about :) To bring > > > > > others up-to-speed, the suggested change would look something like > > > > > this: > > > > > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > unsigned long charge = 0; > > > > > LIST_HEAD(uf); > > > > > VMA_ITERATOR(vmi, mm, 0); > > > > > + bool only_user; > > > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > > return -EINTR; > > > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > > > flush_cache_dup_mm(oldmm); > > > > > uprobe_dup_mmap(oldmm, mm); > > > > > /* > > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > > for_each_vma(vmi, mpnt) { > > > > > struct file *file; > > > > > > > > > > - vma_start_write(mpnt); > > > > > + if (!only_user) > > > > > + vma_start_write(mpnt); > > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > > > > > > > > > > At the time there were woes concerning stability of the new locking > > > > > > scheme, resulting in CC list being rather excessive. As such I'm > > > > > > starting a new thread with a modest list, not sure who to add though. > > > > > > > > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > > > > > vma_start_write() to protect against fault handling in another threads > > > > > > using the same mm. > > > > > > > > > > > > If this is the only thread with this ->mm in place, there are no > > > > > > sibling threads to worry about and this can be checked with mm_users > > > > > > == 1. > > > > > > > > > > > > AFAICS all remote accesses require the mmap_sem to also be held, which > > > > > > provides exclusion against dup_mmap, meaning they don't pose a problem > > > > > > either. > > > > > > > > > > Yes, I believe the optimization you proposed would be safe. > > > > > > > > > > > > > > > > > The patch to merely skip locking is a few liner and I would officially > > > > > > submit it myself, but for my taste an assert is needed in fault > > > > > > handling to runtime test the above invariant. > > > > > > > > > > Hmm. Adding an assert in the pagefault path that checks whether the > > > > > fault is triggered during dup_mmap() && mm_users==1 would not be > > > > > trivial. We would need to indicate that we expect no page faults while > > > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then > > > > > assert that this flag is not set inside the pagefault handling path. > > > > > I'm not sure it's worth the complexity... As was discussed in that > > > > > thread, the only other task that might fault a page would be external > > > > > and therefore would have to go through > > > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > > > > > already holds mmap_write_lock the new user would have to wait. > > > > > > > > > > > So happens I really > > > > > > can't be bothered to figure out how to sort it out and was hoping you > > > > > > would step in. ;) Alternatively if you guys don't think the assert is > > > > > > warranted, that's your business. > > > > > > > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > > > > > already CC'ed Liam) to get their opinion. > > > > > > > > > > > > > > > > > As for whether this can save any locking -- yes: > > > > > > > > > > Yeah, I'm sure it will make a difference in performance. While forking > > > > > we are currently locking each VMA separately, so skipping that would > > > > > be nice. > > > > > Folks, WDYT? Do we need a separate assertion that pagefault can't > > > > > happen if mm_users==1 and we are holding mmap_write_lock > > > > > (access_remote_vm() will block)? > > > > > > > > pseudocode-wise I was thinking in the lines of the following in the fault path: > > > > > > > > if (current->mm != vma->vm_mm) > > > > mmap_assert_locked(vma->vm_mm); > > > > > > > > with the assumption that mmap_assert_locked expands to nothing without debug > > > > > > I see. IOW, if someone external is faulting a page then it has to be > > > holding at least mmap_read_lock. So, it's a more general check but > > > seems reasonable. I think adding it at the end of lock_vma_under_rcu() > > > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough. > > > > > > > > > > > > > > > > > > > > > > > > I added a probe (below for reference) with two args: whether we are > > > > > > single-threaded and vma_start_write() returning whether it took the > > > > > > down/up cycle and ran make -j 20 in the kernel dir. > > > > > > > > > > > > The lock was taken for every single vma (377913 in total), while all > > > > > > forking processes were single-threaded. Or to put it differently all > > > > > > of these were skippable. > > > > > > > > > > > > the probe (total hack): > > > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > > > > > > > > > > > probe diff: > > > > > > diff --git a/fs/namei.c b/fs/namei.c > > > > > > index ecb7b95c2ca3..d6cde76eda81 100644 > > > > > > --- a/fs/namei.c > > > > > > +++ b/fs/namei.c > > > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > > > > > page_symlink_inode_operations = { > > > > > > .get_link = page_get_link, > > > > > > }; > > > > > > EXPORT_SYMBOL(page_symlink_inode_operations); > > > > > > + > > > > > > +void dup_probe(int, int); > > > > > > +void dup_probe(int, int) { } > > > > > > +EXPORT_SYMBOL(dup_probe); > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > > > > > --- a/include/linux/mm.h > > > > > > +++ b/include/linux/mm.h > > > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > > > > > vm_area_struct *vma, unsigned int *mm_l > > > > > > * Exclude concurrent readers under the per-VMA lock until the currently > > > > > > * write-locked mmap_lock is dropped or downgraded. > > > > > > */ > > > > > > -static inline void vma_start_write(struct vm_area_struct *vma) > > > > > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > > > > > { > > > > > > unsigned int mm_lock_seq; > > > > > > > > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > > > > - return; > > > > > > + return false; > > > > > > > > > > > > down_write(&vma->vm_lock->lock); > > > > > > /* > > > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > > > > > vm_area_struct *vma) > > > > > > */ > > > > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > > > > up_write(&vma->vm_lock->lock); > > > > > > + return true; > > > > > > } > > > > > > > > > > > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > > > index 735405a9c5f3..0cc56255a339 100644 > > > > > > --- a/kernel/fork.c > > > > > > +++ b/kernel/fork.c > > > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > > > > > > struct mm_struct *oldmm) > > > > > > pr_warn_once("exe_file_deny_write_access() failed in > > > > > > %s\n", __func__); > > > > > > } > > > > > > > > > > > > +void dup_probe(int, int); > > > > > > + > > > > > > #ifdef CONFIG_MMU > > > > > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > struct mm_struct *oldmm) > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > unsigned long charge = 0; > > > > > > LIST_HEAD(uf); > > > > > > VMA_ITERATOR(vmi, mm, 0); > > > > > > + bool only_user; > > > > > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > > > return -EINTR; > > > > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > > > > flush_cache_dup_mm(oldmm); > > > > > > uprobe_dup_mmap(oldmm, mm); > > > > > > /* > > > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > > > for_each_vma(vmi, mpnt) { > > > > > > struct file *file; > > > > > > + bool locked; > > > > > > + > > > > > > + locked = vma_start_write(mpnt); > > > > > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > > > > > > > > > - vma_start_write(mpnt); > > > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > > > > > -- > > > > > > Mateusz Guzik <mjguzik gmail.com> > > > > > > > > > > > > > > > > -- > > > > Mateusz Guzik <mjguzik gmail.com> > > > > > > > > -- > > Mateusz Guzik <mjguzik gmail.com>
On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote: > > However, the good news is that mm_count tends to be 1. If both > > mm_count and mm_users are 1, then there is no usefaultfd in use and > > nobody to add it either. > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while > calling mmgrab(), therefore I think it can race with the code checking > its value. > It issues: ctx->mm = current->mm; ... mmgrab(ctx->mm); Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to create it either and the optimization is saved.
* Mateusz Guzik <mjguzik@gmail.com> [250330 15:43]: > On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > However, the good news is that mm_count tends to be 1. If both > > > mm_count and mm_users are 1, then there is no usefaultfd in use and > > > nobody to add it either. > > > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while > > calling mmgrab(), therefore I think it can race with the code checking > > its value. > > Considering the mm struct isn't the only way to find the vmas and there are users who use other locks to ensure the mm and vma don't go away (rmap, for example). It is reasonable to think that other users may use the vma lock to avoid mm_struct accesses racing. Although I don't know of a way this is unsafe today, we are complicating the locking story of the mm with this change and no data has been given on the benefit. I don't recall any regression caused by the addition of per-vma locking? > > It issues: > ctx->mm = current->mm; > ... > mmgrab(ctx->mm); > > Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in > dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to > create it either and the optimization is saved. mm_count is lazy, so I am not entirely sure we can trust what it says. But maybe that's only true of mmgrab_lazy_tlb() now? Thanks, Liam
On Mon, Mar 31, 2025 at 6:43 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Mateusz Guzik <mjguzik@gmail.com> [250330 15:43]: > > On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > However, the good news is that mm_count tends to be 1. If both > > > > mm_count and mm_users are 1, then there is no usefaultfd in use and > > > > nobody to add it either. > > > > > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while > > > calling mmgrab(), therefore I think it can race with the code checking > > > its value. > > > > > Considering the mm struct isn't the only way to find the vmas and there > are users who use other locks to ensure the mm and vma don't go away > (rmap, for example). It is reasonable to think that other users may use > the vma lock to avoid mm_struct accesses racing. > > Although I don't know of a way this is unsafe today, we are complicating > the locking story of the mm with this change and no data has been given > on the benefit. I don't recall any regression caused by the addition of > per-vma locking? > I was not involved in any of this, but I learned about the issue from lwn: https://lwn.net/Articles/937943/ the new (at the time) per-vma locking was suffering weird crashes in multithreaded programs and Suren ultimately fixed it by locking parent vma at a 5% hit, see fb49c455323f ("fork: lock VMAs of the parent process when forking"). The patch merely adds vma_start_write(mpnt) in dup_mmap. What I'm proposing here remedies the problem for most commonly forking consumers (single-threaded), assuming it does work. ;) To that end see below. > > > > It issues: > > ctx->mm = current->mm; > > ... > > mmgrab(ctx->mm); > > > > Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in > > dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to > > create it either and the optimization is saved. > > mm_count is lazy, so I am not entirely sure we can trust what it says. > But maybe that's only true of mmgrab_lazy_tlb() now? > warning: I don't know the Linux nomenclature here. I'm going to outline how I see it. There is an idiomatic way of splitting ref counts into two: - something to prevent the struct itself from getting freed ("hold count" where I'm from, in this case ->mm_count) - something to prevent data used by the structure from getting freed ("use count" where I'm from, in this case ->mm_users) mm_users > 0 keeps one ref on ->mm_count AFAICS the scheme employed for mm follows the mold. So with that mmgrab_lazy_tlb() example, the call bumps the count on first use. Suppose we are left with one thread in the process and a lazy tlb somewhere as the only consumers. mm_users is 1 because of the only thread and mm_count is 2 -- one ref for mm_users > 0 and one ref for lazy tlb. Then my proposed check: ->mm_count == 1 && mm->mm_users == 1 ... fails and the optimization is not used. Now, per the above, the lock was added to protect against faults happening in parallel. The only cases I found where this is of note are remote accesses (e.g., from /proc/pid/cmdline) and userfaultfd. I'm not an mm person and this is why I referred to Suren to sort this out, hoping he would both have interest and enough knowledge about mm to validate it. That is to say I don't *vouch* the idea myself (otherwise I would sign off on a patch), I am merely bringing it up again long after the dust has settled. If the idea is a nogo, then so be it, but then it would be nice to document somewhere why is it so.
On Mon, Mar 31, 2025 at 10:51 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Mon, Mar 31, 2025 at 6:43 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > * Mateusz Guzik <mjguzik@gmail.com> [250330 15:43]: > > > On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > However, the good news is that mm_count tends to be 1. If both > > > > > mm_count and mm_users are 1, then there is no usefaultfd in use and > > > > > nobody to add it either. > > > > > > > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while > > > > calling mmgrab(), therefore I think it can race with the code checking > > > > its value. > > > > > > > > Considering the mm struct isn't the only way to find the vmas and there > > are users who use other locks to ensure the mm and vma don't go away > > (rmap, for example). It is reasonable to think that other users may use > > the vma lock to avoid mm_struct accesses racing. > > > > Although I don't know of a way this is unsafe today, we are complicating > > the locking story of the mm with this change and no data has been given > > on the benefit. I don't recall any regression caused by the addition of > > per-vma locking? > > > > I was not involved in any of this, but I learned about the issue from lwn: > https://lwn.net/Articles/937943/ > > the new (at the time) per-vma locking was suffering weird crashes in > multithreaded programs and Suren ultimately fixed it by locking parent > vma at a 5% hit, > see fb49c455323f ("fork: lock VMAs of the parent process when > forking"). The patch merely adds vma_start_write(mpnt) in dup_mmap. > > What I'm proposing here remedies the problem for most commonly forking > consumers (single-threaded), assuming it does work. ;) > > To that end see below. > > > > > > > It issues: > > > ctx->mm = current->mm; > > > ... > > > mmgrab(ctx->mm); > > > > > > Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in > > > dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to > > > create it either and the optimization is saved. > > > > mm_count is lazy, so I am not entirely sure we can trust what it says. > > But maybe that's only true of mmgrab_lazy_tlb() now? > > > > warning: I don't know the Linux nomenclature here. I'm going to > outline how I see it. > > There is an idiomatic way of splitting ref counts into two: > - something to prevent the struct itself from getting freed ("hold > count" where I'm from, in this case ->mm_count) > - something to prevent data used by the structure from getting freed > ("use count" where I'm from, in this case ->mm_users) > > mm_users > 0 keeps one ref on ->mm_count > > AFAICS the scheme employed for mm follows the mold. > > So with that mmgrab_lazy_tlb() example, the call bumps the count on first use. > > Suppose we are left with one thread in the process and a lazy tlb > somewhere as the only consumers. mm_users is 1 because of the only > thread and mm_count is 2 -- one ref for mm_users > 0 and one ref for > lazy tlb. > > Then my proposed check: ->mm_count == 1 && mm->mm_users == 1 > > ... fails and the optimization is not used. > > Now, per the above, the lock was added to protect against faults > happening in parallel. The only cases I found where this is of note > are remote accesses (e.g., from /proc/pid/cmdline) and userfaultfd. > > I'm not an mm person and this is why I referred to Suren to sort this > out, hoping he would both have interest and enough knowledge about mm > to validate it. > > That is to say I don't *vouch* the idea myself (otherwise I would sign > off on a patch), I am merely bringing it up again long after the dust > has settled. If the idea is a nogo, then so be it, but then it would > be nice to document somewhere why is it so. I think it would be worth optimizing if it's as straight-forward as we think so far. I would like to spend some more time checking uffd code to see if anything else might be lurking there before posting the patch. If I don't find anything new I'll post a patch sometime next week. Thanks, Suren. > -- > Mateusz Guzik <mjguzik gmail.com>
* Mateusz Guzik <mjguzik@gmail.com> [250331 13:51]: > On Mon, Mar 31, 2025 at 6:43 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > * Mateusz Guzik <mjguzik@gmail.com> [250330 15:43]: > > > On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > However, the good news is that mm_count tends to be 1. If both > > > > > mm_count and mm_users are 1, then there is no usefaultfd in use and > > > > > nobody to add it either. > > > > > > > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while > > > > calling mmgrab(), therefore I think it can race with the code checking > > > > its value. > > > > > > > > Considering the mm struct isn't the only way to find the vmas and there > > are users who use other locks to ensure the mm and vma don't go away > > (rmap, for example). It is reasonable to think that other users may use > > the vma lock to avoid mm_struct accesses racing. > > > > Although I don't know of a way this is unsafe today, we are complicating > > the locking story of the mm with this change and no data has been given > > on the benefit. I don't recall any regression caused by the addition of > > per-vma locking? > > > > I was not involved in any of this, but I learned about the issue from lwn: > https://lwn.net/Articles/937943/ > > the new (at the time) per-vma locking was suffering weird crashes in > multithreaded programs and Suren ultimately fixed it by locking parent > vma at a 5% hit, > see fb49c455323f ("fork: lock VMAs of the parent process when > forking"). The patch merely adds vma_start_write(mpnt) in dup_mmap. Ah, thanks for the pointer. I forgot about that part. You are referencing the 5% on 5,000 forks of 10,000 vmas in a tight loop. The kernel build time was not affected. I'm sticking with this being an inconsequential increase on any meaningful benchmark. Reading the commit did trigger my memory about the 5% regression, but I disregarded it as it seems so unlikely in real life that the benefits outweighed the upper bounds of 5% negative. The 5,000 forks of 10,000 vmas benchmark would also be at least a bit faster with Suren's more recent rcu safe vma change [1]. Although, that doesn't rule out changing this to get higher number, I still don't think the complexity of the change is worth it for a contrived benchmark. > > What I'm proposing here remedies the problem for most commonly forking > consumers (single-threaded), assuming it does work. ;) > > To that end see below. > > > > > > > It issues: > > > ctx->mm = current->mm; > > > ... > > > mmgrab(ctx->mm); > > > > > > Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in > > > dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to > > > create it either and the optimization is saved. > > > > mm_count is lazy, so I am not entirely sure we can trust what it says. > > But maybe that's only true of mmgrab_lazy_tlb() now? > > > > warning: I don't know the Linux nomenclature here. I'm going to > outline how I see it. > > There is an idiomatic way of splitting ref counts into two: > - something to prevent the struct itself from getting freed ("hold > count" where I'm from, in this case ->mm_count) > - something to prevent data used by the structure from getting freed > ("use count" where I'm from, in this case ->mm_users) > > mm_users > 0 keeps one ref on ->mm_count > > AFAICS the scheme employed for mm follows the mold. > > So with that mmgrab_lazy_tlb() example, the call bumps the count on first use. > > Suppose we are left with one thread in the process and a lazy tlb > somewhere as the only consumers. mm_users is 1 because of the only > thread and mm_count is 2 -- one ref for mm_users > 0 and one ref for > lazy tlb. > > Then my proposed check: ->mm_count == 1 && mm->mm_users == 1 > > ... fails and the optimization is not used. > > Now, per the above, the lock was added to protect against faults > happening in parallel. The only cases I found where this is of note > are remote accesses (e.g., from /proc/pid/cmdline) and userfaultfd. > > I'm not an mm person and this is why I referred to Suren to sort this > out, hoping he would both have interest and enough knowledge about mm > to validate it. > > That is to say I don't *vouch* the idea myself (otherwise I would sign > off on a patch), I am merely bringing it up again long after the dust > has settled. If the idea is a nogo, then so be it, but then it would > be nice to document somewhere why is it so. Thanks for the breakdown here. I am also not sure, but I felt it worth the mention. Reading Documentation/mm/active_mm.rst makes me uncertain of the mm_count == 1 and mm_users == 1 test. Since mm_count is the number of 'lazy' users, and it may no longer include the lazy users.. Since there are no real workloads that suffer, is this worth it? [1]. https://lore.kernel.org/all/202503311656.e3596aaf-lkp@intel.com/ Thanks, Liam
diff --git a/fs/namei.c b/fs/namei.c index ecb7b95c2ca3..d6cde76eda81 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -5459,3 +5459,7 @@ const struct inode_operations page_symlink_inode_operations = { .get_link = page_get_link, }; EXPORT_SYMBOL(page_symlink_inode_operations); + +void dup_probe(int, int); +void dup_probe(int, int) { } +EXPORT_SYMBOL(dup_probe); diff --git a/include/linux/mm.h b/include/linux/mm.h index 1f80baddacc5..f7b1f0a02f2e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_l * Exclude concurrent readers under the per-VMA lock until the currently * write-locked mmap_lock is dropped or downgraded. */ -static inline void vma_start_write(struct vm_area_struct *vma) +static inline bool vma_start_write(struct vm_area_struct *vma) { unsigned int mm_lock_seq; if (__is_vma_write_locked(vma, &mm_lock_seq)) - return; + return false; down_write(&vma->vm_lock->lock); /* @@ -776,6 +776,7 @@ static inline void vma_start_write(struct vm_area_struct *vma) */ WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); up_write(&vma->vm_lock->lock); + return true; } static inline void vma_assert_write_locked(struct vm_area_struct *vma) diff --git a/kernel/fork.c b/kernel/fork.c index 735405a9c5f3..0cc56255a339 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) pr_warn_once("exe_file_deny_write_access() failed in %s\n", __func__); } +void dup_probe(int, int); + #ifdef CONFIG_MMU