Message ID | 20211209191325.3069345-3-surenb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v5,1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap | expand |
On Thu, Dec 09, 2021 at 11:13:25AM -0800, Suren Baghdasaryan wrote: > With exit_mmap holding mmap_write_lock during free_pgtables call, > process_mrelease does not need to elevate mm->mm_users in order to > prevent exit_mmap from destrying pagetables while __oom_reap_task_mm > is walking the VMA tree. The change prevents process_mrelease from > calling the last mmput, which can lead to waiting for IO completion > in exit_aio. > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > Acked-by: Michal Hocko <mhocko@suse.com> > --- > changes in v5 > - Removed Fixes: tag, per Michal Hocko > - Added Acked-by's > > mm/oom_kill.c | 27 +++++++++++++++------------ > 1 file changed, 15 insertions(+), 12 deletions(-) Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> There are mmget_not_zero's all over the place, can others be cleaned after this series goes ahead too? It seems like anything doing the mmget just to look at the vma list under the mmap lock is now fine with only a mmgrab? A few I know about: drivers/infiniband/core/umem_odp.c: if (!mmget_not_zero(umem->owning_mm)) { This is because mmu_interval_notifier_insert() might call mm_take_all_locks() which was unsafe with concurrent exit_mmap drivers/infiniband/core/umem_odp.c: if (!owning_process || !mmget_not_zero(owning_mm)) { This is because it calls hmm_range_fault() which iterates over the vma list which is safe now drivers/iommu/iommu-sva-lib.c: return mmget_not_zero(mm); drivers/iommu/iommu-sva-lib.c: return ioasid_find(&iommu_sva_pasid, pasid, __mmget_not_zero); It calls find_extend_vma() - but also it doesn't seem to have a mmgrab when it does that mmget. The rcu is messed up here too, so humm. Jason
On Thu, Dec 9, 2021 at 12:48 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Dec 09, 2021 at 11:13:25AM -0800, Suren Baghdasaryan wrote: > > With exit_mmap holding mmap_write_lock during free_pgtables call, > > process_mrelease does not need to elevate mm->mm_users in order to > > prevent exit_mmap from destrying pagetables while __oom_reap_task_mm > > is walking the VMA tree. The change prevents process_mrelease from > > calling the last mmput, which can lead to waiting for IO completion > > in exit_aio. > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > Acked-by: Michal Hocko <mhocko@suse.com> > > --- > > changes in v5 > > - Removed Fixes: tag, per Michal Hocko > > - Added Acked-by's > > > > mm/oom_kill.c | 27 +++++++++++++++------------ > > 1 file changed, 15 insertions(+), 12 deletions(-) > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Thanks! > > There are mmget_not_zero's all over the place, can others be cleaned > after this series goes ahead too? > > It seems like anything doing the mmget just to look at the vma list > under the mmap lock is now fine with only a mmgrab? Sounds reasonable to me. I'll try to carve out some time to look into it but no promises. Lots of loose ends to tie at the end of the year :( > > A few I know about: > > drivers/infiniband/core/umem_odp.c: if (!mmget_not_zero(umem->owning_mm)) { > > This is because mmu_interval_notifier_insert() might call > mm_take_all_locks() which was unsafe with concurrent exit_mmap > > drivers/infiniband/core/umem_odp.c: if (!owning_process || !mmget_not_zero(owning_mm)) { > > This is because it calls hmm_range_fault() which iterates over the vma > list which is safe now > > drivers/iommu/iommu-sva-lib.c: return mmget_not_zero(mm); > drivers/iommu/iommu-sva-lib.c: return ioasid_find(&iommu_sva_pasid, pasid, __mmget_not_zero); > > It calls find_extend_vma() - but also it doesn't seem to have a mmgrab when it > does that mmget. The rcu is messed up here too, so humm. > > Jason
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1ddabefcfb5a..67780386f478 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1169,15 +1169,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) goto put_task; } - if (mmget_not_zero(p->mm)) { - mm = p->mm; - if (task_will_free_mem(p)) - reap = true; - else { - /* Error only if the work has not been done already */ - if (!test_bit(MMF_OOM_SKIP, &mm->flags)) - ret = -EINVAL; - } + mm = p->mm; + mmgrab(mm); + + if (task_will_free_mem(p)) + reap = true; + else { + /* Error only if the work has not been done already */ + if (!test_bit(MMF_OOM_SKIP, &mm->flags)) + ret = -EINVAL; } task_unlock(p); @@ -1188,13 +1188,16 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) ret = -EINTR; goto drop_mm; } - if (!__oom_reap_task_mm(mm)) + /* + * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure + * possible change in exit_mmap is seen + */ + if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm)) ret = -EAGAIN; mmap_read_unlock(mm); drop_mm: - if (mm) - mmput(mm); + mmdrop(mm); put_task: put_task_struct(task); return ret;