diff mbox series

[v5,7/7] mm: Remove the now-unnecessary mmget_still_valid() hack

Message ID 20200827114932.3572699-8-jannh@google.com (mailing list archive)
State New, archived
Headers show
Series Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there | expand

Commit Message

Jann Horn Aug. 27, 2020, 11:49 a.m. UTC
The preceding patches have ensured that core dumping properly takes the
mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
its users.

Signed-off-by: Jann Horn <jannh@google.com>
---
 drivers/infiniband/core/uverbs_main.c |  3 ---
 drivers/vfio/pci/vfio_pci.c           | 38 +++++++++++++--------------
 fs/proc/task_mmu.c                    | 18 -------------
 fs/userfaultfd.c                      | 28 +++++++-------------
 include/linux/sched/mm.h              | 25 ------------------
 mm/khugepaged.c                       |  2 +-
 mm/madvise.c                          | 17 ------------
 mm/mmap.c                             |  5 +---
 8 files changed, 29 insertions(+), 107 deletions(-)

Comments

Hugh Dickins Aug. 31, 2020, 6:06 a.m. UTC | #1
On Thu, 27 Aug 2020, Jann Horn wrote:

> The preceding patches have ensured that core dumping properly takes the
> mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
> its users.

Hi Jann, while the only tears to be shed over losing mmget_still_valid()
will be tears of joy, I think you need to explain why you believe it's
safe to remove the instance in mm/khugepaged.c: which you'll have found
I moved just recently, to cover an extra case (sorry for not Cc'ing you).

> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -431,7 +431,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
>  
>  static inline int khugepaged_test_exit(struct mm_struct *mm)
>  {
> -	return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm);
> +	return atomic_read(&mm->mm_users) == 0;
>  }
>  
>  static bool hugepage_vma_check(struct vm_area_struct *vma,

The movement (which you have correctly followed) was in
bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
but the "pmd .. physical page 0" issue is explained better in its parent
18e77600f7a1 ("khugepaged: retract_page_tables() remember to test exit")

I think your core dumping is still reading the page tables without
holding mmap_lock, so still vulnerable to that extra issue.  It won't
be as satisfying as removing all traces of mmget_still_valid(), but
right now I think you should add an mm->core_state check there instead.

(I do have a better solution in use, but it's a much larger patch, that
will take a lot more effort to get in: checks in pte_offset_map_lock(),
perhaps returning NULL when pmd is transitioning, requiring retry.)

Or maybe it's me who has missed what you're doing instead.

Hugh
Jann Horn Aug. 31, 2020, 9:58 a.m. UTC | #2
On Mon, Aug 31, 2020 at 8:07 AM Hugh Dickins <hughd@google.com> wrote:
> On Thu, 27 Aug 2020, Jann Horn wrote:
>
> > The preceding patches have ensured that core dumping properly takes the
> > mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
> > its users.
>
> Hi Jann, while the only tears to be shed over losing mmget_still_valid()
> will be tears of joy, I think you need to explain why you believe it's
> safe to remove the instance in mm/khugepaged.c: which you'll have found
> I moved just recently, to cover an extra case (sorry for not Cc'ing you).
>
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -431,7 +431,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
> >
> >  static inline int khugepaged_test_exit(struct mm_struct *mm)
> >  {
> > -     return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm);
> > +     return atomic_read(&mm->mm_users) == 0;
> >  }
> >
> >  static bool hugepage_vma_check(struct vm_area_struct *vma,
>
> The movement (which you have correctly followed) was in
> bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
> but the "pmd .. physical page 0" issue is explained better in its parent
> 18e77600f7a1 ("khugepaged: retract_page_tables() remember to test exit")
>
> I think your core dumping is still reading the page tables without
> holding mmap_lock

Where? get_dump_page() takes mmap_lock now:
<https://lore.kernel.org/lkml/20200827114932.3572699-7-jannh@google.com/>

I don't think there should be any paths into __get_user_pages() left
that don't hold the mmap_lock. Actually, we should probably try
sticking mmap_assert_locked() in there now as a follow-up?

> so still vulnerable to that extra issue.  It won't
> be as satisfying as removing all traces of mmget_still_valid(), but
> right now I think you should add an mm->core_state check there instead.
>
> (I do have a better solution in use, but it's a much larger patch, that
> will take a lot more effort to get in: checks in pte_offset_map_lock(),
> perhaps returning NULL when pmd is transitioning, requiring retry.)

Just to clarify: This is an issue only between GUP's software page
table walks when running without mmap_lock and concurrent page table
modifications from hugepage code, correct? Hardware page table walks
and get_user_pages_fast() are fine because they properly load PTEs
atomically and are written to assume that the page tables can change
arbitrarily under them, and the only guarantee is that disabling
interrupts ensures that pages referenced by PTEs can't be freed,
right?

> Or maybe it's me who has missed what you're doing instead.
>
> Hugh
Hugh Dickins Aug. 31, 2020, 8:36 p.m. UTC | #3
On Mon, 31 Aug 2020, Jann Horn wrote:
> On Mon, Aug 31, 2020 at 8:07 AM Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 27 Aug 2020, Jann Horn wrote:
> >
> > > The preceding patches have ensured that core dumping properly takes the
> > > mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
> > > its users.
> >
> > Hi Jann, while the only tears to be shed over losing mmget_still_valid()
> > will be tears of joy, I think you need to explain why you believe it's
> > safe to remove the instance in mm/khugepaged.c: which you'll have found
> > I moved just recently, to cover an extra case (sorry for not Cc'ing you).
> >
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -431,7 +431,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
> > >
> > >  static inline int khugepaged_test_exit(struct mm_struct *mm)
> > >  {
> > > -     return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm);
> > > +     return atomic_read(&mm->mm_users) == 0;
> > >  }
> > >
> > >  static bool hugepage_vma_check(struct vm_area_struct *vma,
> >
> > The movement (which you have correctly followed) was in
> > bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
> > but the "pmd .. physical page 0" issue is explained better in its parent
> > 18e77600f7a1 ("khugepaged: retract_page_tables() remember to test exit")
> >
> > I think your core dumping is still reading the page tables without
> > holding mmap_lock
> 
> Where? get_dump_page() takes mmap_lock now:
> <https://lore.kernel.org/lkml/20200827114932.3572699-7-jannh@google.com/>

Right, sorry for the noise, that's precisely what 6/7 is all about,
and properly declares itself there in its Subject - I plead that I
got distracted by the vma snapshot part of the series, and paid too
little attention before bleating.

Looks good to me - thanks.

> 
> I don't think there should be any paths into __get_user_pages() left
> that don't hold the mmap_lock. Actually, we should probably try
> sticking mmap_assert_locked() in there now as a follow-up?

Maybe: I haven't given it thought, to be honest.

Hugh
Hugh Dickins Aug. 31, 2020, 9:30 p.m. UTC | #4
I didn't answer your questions further down, sorry, resuming...

On Mon, 31 Aug 2020, Jann Horn wrote:
> On Mon, Aug 31, 2020 at 8:07 AM Hugh Dickins <hughd@google.com> wrote:
...
> > but the "pmd .. physical page 0" issue is explained better in its parent
> > 18e77600f7a1 ("khugepaged: retract_page_tables() remember to test exit")
...
> Just to clarify: This is an issue only between GUP's software page

Not just GUP's software page table walks: any of our software page
table walks that could occur concurrently (notably, unmapping when
exiting).

> table walks when running without mmap_lock and concurrent page table
> modifications from hugepage code, correct?

Correct.

> Hardware page table walks

Have no problem: the necessary TLB flush is already done.

> and get_user_pages_fast() are fine because they properly load PTEs
> atomically and are written to assume that the page tables can change
> arbitrarily under them, and the only guarantee is that disabling
> interrupts ensures that pages referenced by PTEs can't be freed,
> right?

mm/gup.c has changed a lot since I was familiar with it, and I'm
out of touch with the history of architectural variants.  I think
internal_get_user_pages_fast() is now the place to look, and I see

		local_irq_save(flags);
		gup_pgd_range(addr, end, fast_flags, pages, &nr_pinned);
		local_irq_restore(flags);

reassuringly there, which is how x86 always used to do it,
and the dependence of x86 TLB flush on IPIs made it all safe.

Looking at gup_pmd_range(), its operations on pmd (= READ_ONCE(*pmdp))
look correct to me, and where I said "any of our software page table
walks" above, there should be an exception for GUP_fast.

But the other software page table walks are more loosely coded, and
less able to fall back - if gup_pmd_range() catches sight of a fleeting
*pmdp 0, it rightly just gives up immediately on !pmd_present(pmd);
whereas tearing down a userspace mapping needs to wait or retry on
seeing a transient state (but mmap_lock happens to give protection
against that particular transient state).

I assume that all the architectures which support GUP_fast have now
been gathered into the same mechanism (perhaps by an otherwise
superfluous IPI on TLB flush?) and are equally safe.

Hugh
diff mbox series

Patch

diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 37794d88b1f3..a4ba0b87d6de 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -845,8 +845,6 @@  void uverbs_user_mmap_disassociate(struct ib_uverbs_file *ufile)
 		 * will only be one mm, so no big deal.
 		 */
 		mmap_read_lock(mm);
-		if (!mmget_still_valid(mm))
-			goto skip_mm;
 		mutex_lock(&ufile->umap_lock);
 		list_for_each_entry_safe (priv, next_priv, &ufile->umaps,
 					  list) {
@@ -865,7 +863,6 @@  void uverbs_user_mmap_disassociate(struct ib_uverbs_file *ufile)
 			}
 		}
 		mutex_unlock(&ufile->umap_lock);
-	skip_mm:
 		mmap_read_unlock(mm);
 		mmput(mm);
 	}
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 620465c2a1da..27f11cc7ba6c 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1480,31 +1480,29 @@  static int vfio_pci_zap_and_vma_lock(struct vfio_pci_device *vdev, bool try)
 		} else {
 			mmap_read_lock(mm);
 		}
-		if (mmget_still_valid(mm)) {
-			if (try) {
-				if (!mutex_trylock(&vdev->vma_lock)) {
-					mmap_read_unlock(mm);
-					mmput(mm);
-					return 0;
-				}
-			} else {
-				mutex_lock(&vdev->vma_lock);
+		if (try) {
+			if (!mutex_trylock(&vdev->vma_lock)) {
+				mmap_read_unlock(mm);
+				mmput(mm);
+				return 0;
 			}
-			list_for_each_entry_safe(mmap_vma, tmp,
-						 &vdev->vma_list, vma_next) {
-				struct vm_area_struct *vma = mmap_vma->vma;
+		} else {
+			mutex_lock(&vdev->vma_lock);
+		}
+		list_for_each_entry_safe(mmap_vma, tmp,
+					 &vdev->vma_list, vma_next) {
+			struct vm_area_struct *vma = mmap_vma->vma;
 
-				if (vma->vm_mm != mm)
-					continue;
+			if (vma->vm_mm != mm)
+				continue;
 
-				list_del(&mmap_vma->vma_next);
-				kfree(mmap_vma);
+			list_del(&mmap_vma->vma_next);
+			kfree(mmap_vma);
 
-				zap_vma_ptes(vma, vma->vm_start,
-					     vma->vm_end - vma->vm_start);
-			}
-			mutex_unlock(&vdev->vma_lock);
+			zap_vma_ptes(vma, vma->vm_start,
+				     vma->vm_end - vma->vm_start);
 		}
+		mutex_unlock(&vdev->vma_lock);
 		mmap_read_unlock(mm);
 		mmput(mm);
 	}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5066b0251ed8..c43490aec95d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1168,24 +1168,6 @@  static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 					count = -EINTR;
 					goto out_mm;
 				}
-				/*
-				 * Avoid to modify vma->vm_flags
-				 * without locked ops while the
-				 * coredump reads the vm_flags.
-				 */
-				if (!mmget_still_valid(mm)) {
-					/*
-					 * Silently return "count"
-					 * like if get_task_mm()
-					 * failed. FIXME: should this
-					 * function have returned
-					 * -ESRCH if get_task_mm()
-					 * failed like if
-					 * get_proc_task() fails?
-					 */
-					mmap_write_unlock(mm);
-					goto out_mm;
-				}
 				for (vma = mm->mmap; vma; vma = vma->vm_next) {
 					vma->vm_flags &= ~VM_SOFTDIRTY;
 					vma_set_page_prot(vma);
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 0e4a3837da52..000b457ad087 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -601,8 +601,6 @@  static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
 
 		/* the various vma->vm_userfaultfd_ctx still points to it */
 		mmap_write_lock(mm);
-		/* no task can run (and in turn coredump) yet */
-		VM_WARN_ON(!mmget_still_valid(mm));
 		for (vma = mm->mmap; vma; vma = vma->vm_next)
 			if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
 				vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
@@ -842,7 +840,6 @@  static int userfaultfd_release(struct inode *inode, struct file *file)
 	/* len == 0 means wake all */
 	struct userfaultfd_wake_range range = { .len = 0, };
 	unsigned long new_flags;
-	bool still_valid;
 
 	WRITE_ONCE(ctx->released, true);
 
@@ -858,7 +855,6 @@  static int userfaultfd_release(struct inode *inode, struct file *file)
 	 * taking the mmap_lock for writing.
 	 */
 	mmap_write_lock(mm);
-	still_valid = mmget_still_valid(mm);
 	prev = NULL;
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		cond_resched();
@@ -869,17 +865,15 @@  static int userfaultfd_release(struct inode *inode, struct file *file)
 			continue;
 		}
 		new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
-		if (still_valid) {
-			prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
-					 new_flags, vma->anon_vma,
-					 vma->vm_file, vma->vm_pgoff,
-					 vma_policy(vma),
-					 NULL_VM_UFFD_CTX);
-			if (prev)
-				vma = prev;
-			else
-				prev = vma;
-		}
+		prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
+				 new_flags, vma->anon_vma,
+				 vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 NULL_VM_UFFD_CTX);
+		if (prev)
+			vma = prev;
+		else
+			prev = vma;
 		vma->vm_flags = new_flags;
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 	}
@@ -1309,8 +1303,6 @@  static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		goto out;
 
 	mmap_write_lock(mm);
-	if (!mmget_still_valid(mm))
-		goto out_unlock;
 	vma = find_vma_prev(mm, start, &prev);
 	if (!vma)
 		goto out_unlock;
@@ -1511,8 +1503,6 @@  static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		goto out;
 
 	mmap_write_lock(mm);
-	if (!mmget_still_valid(mm))
-		goto out_unlock;
 	vma = find_vma_prev(mm, start, &prev);
 	if (!vma)
 		goto out_unlock;
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index f889e332912f..e9cd1e637d76 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,31 +49,6 @@  static inline void mmdrop(struct mm_struct *mm)
 		__mmdrop(mm);
 }
 
-/*
- * This has to be called after a get_task_mm()/mmget_not_zero()
- * followed by taking the mmap_lock for writing before modifying the
- * vmas or anything the coredump pretends not to change from under it.
- *
- * It also has to be called when mmgrab() is used in the context of
- * the process, but then the mm_count refcount is transferred outside
- * the context of the process to run down_write() on that pinned mm.
- *
- * NOTE: find_extend_vma() called from GUP context is the only place
- * that can modify the "mm" (notably the vm_start/end) under mmap_lock
- * for reading and outside the context of the process, so it is also
- * the only case that holds the mmap_lock for reading that must call
- * this function. Generally if the mmap_lock is hold for reading
- * there's no need of this check after get_task_mm()/mmget_not_zero().
- *
- * This function can be obsoleted and the check can be removed, after
- * the coredump code will hold the mmap_lock for writing before
- * invoking the ->core_dump methods.
- */
-static inline bool mmget_still_valid(struct mm_struct *mm)
-{
-	return likely(!mm->core_state);
-}
-
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 15a9af791014..101b636c72b5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -431,7 +431,7 @@  static void insert_to_mm_slots_hash(struct mm_struct *mm,
 
 static inline int khugepaged_test_exit(struct mm_struct *mm)
 {
-	return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm);
+	return atomic_read(&mm->mm_users) == 0;
 }
 
 static bool hugepage_vma_check(struct vm_area_struct *vma,
diff --git a/mm/madvise.c b/mm/madvise.c
index dd1d43cf026d..d5b33d9011f0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1091,23 +1091,6 @@  int do_madvise(unsigned long start, size_t len_in, int behavior)
 	if (write) {
 		if (mmap_write_lock_killable(current->mm))
 			return -EINTR;
-
-		/*
-		 * We may have stolen the mm from another process
-		 * that is undergoing core dumping.
-		 *
-		 * Right now that's io_ring, in the future it may
-		 * be remote process management and not "current"
-		 * at all.
-		 *
-		 * We need to fix core dumping to not do this,
-		 * but for now we have the mmget_still_valid()
-		 * model.
-		 */
-		if (!mmget_still_valid(current->mm)) {
-			mmap_write_unlock(current->mm);
-			return -EINTR;
-		}
 	} else {
 		mmap_read_lock(current->mm);
 	}
diff --git a/mm/mmap.c b/mm/mmap.c
index 40248d84ad5f..c47abe460439 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2552,7 +2552,7 @@  find_extend_vma(struct mm_struct *mm, unsigned long addr)
 	if (vma && (vma->vm_start <= addr))
 		return vma;
 	/* don't alter vm_end if the coredump is running */
-	if (!prev || !mmget_still_valid(mm) || expand_stack(prev, addr))
+	if (!prev || expand_stack(prev, addr))
 		return NULL;
 	if (prev->vm_flags & VM_LOCKED)
 		populate_vma_page_range(prev, addr, prev->vm_end, NULL);
@@ -2578,9 +2578,6 @@  find_extend_vma(struct mm_struct *mm, unsigned long addr)
 		return vma;
 	if (!(vma->vm_flags & VM_GROWSDOWN))
 		return NULL;
-	/* don't alter vm_start if the coredump is running */
-	if (!mmget_still_valid(mm))
-		return NULL;
 	start = vma->vm_start;
 	if (expand_stack(vma, addr))
 		return NULL;