[v5,07/27] mm/mmap: Create a guard area between VMAs

Message ID	20181011151523.27101-8-yu-cheng.yu@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of yu-cheng.yu@intel.com designates 134.134.136.24 as permitted sender) client-ip=134.134.136.24; From: Yu-cheng Yu <yu-cheng.yu@intel.com> To: x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann <arnd@arndb.de>, Andy Lutomirski <luto@amacapital.net>, Balbir Singh <bsingharora@gmail.com>, Cyrill Gorcunov <gorcunov@gmail.com>, Dave Hansen <dave.hansen@linux.intel.com>, Eugene Syromiatnikov <esyr@redhat.com>, Florian Weimer <fweimer@redhat.com>, "H.J. Lu" <hjl.tools@gmail.com>, Jann Horn <jannh@google.com>, Jonathan Corbet <corbet@lwn.net>, Kees Cook <keescook@chromium.org>, Mike Kravetz <mike.kravetz@oracle.com>, Nadav Amit <nadav.amit@gmail.com>, Oleg Nesterov <oleg@redhat.com>, Pavel Machek <pavel@ucw.cz>, Peter Zijlstra <peterz@infradead.org>, Randy Dunlap <rdunlap@infradead.org>, "Ravi V. Shankar" <ravi.v.shankar@intel.com>, Vedvyas Shanbhogue <vedvyas.shanbhogue@intel.com> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Subject: [PATCH v5 07/27] mm/mmap: Create a guard area between VMAs Date: Thu, 11 Oct 2018 08:15:03 -0700 Message-Id: <20181011151523.27101-8-yu-cheng.yu@intel.com> In-Reply-To: <20181011151523.27101-1-yu-cheng.yu@intel.com> References: <20181011151523.27101-1-yu-cheng.yu@intel.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Control Flow Enforcement: Shadow Stack \| expand [v5,00/27] Control Flow Enforcement: Shadow Stack [v5,01/27] x86/cpufeatures: Add CPUIDs for Control Flow Enforcement Technology (CET) [v5,02/27] x86/fpu/xstate: Change names to separate XSAVES system and user states [v5,03/27] x86/fpu/xstate: Introduce XSAVES system states [v5,04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack [v5,05/27] Documentation/x86: Add CET description [v5,06/27] x86/cet: Control protection exception handler [v5,07/27] mm/mmap: Create a guard area between VMAs [v5,08/27] x86/cet/shstk: Add Kconfig option for user-mode shadow stack [v5,09/27] mm: Introduce VM_SHSTK for shadow stack memory [v5,10/27] mm/mmap: Prevent Shadow Stack VMA merges [v5,11/27] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW [v5,12/27] x86/mm: Introduce _PAGE_DIRTY_SW [v5,13/27] drm/i915/gvt: Update _PAGE_DIRTY to _PAGE_DIRTY_BITS [v5,14/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW [v5,15/27] x86/mm: Shadow stack page fault error checking [v5,16/27] mm: Handle shadow stack page fault [v5,17/27] mm: Handle THP/HugeTLB shadow stack page fault [v5,18/27] mm: Update can_follow_write_pte/pmd for shadow stack [v5,19/27] mm: Introduce do_mmap_locked() [v5,20/27] x86/cet/shstk: User-mode shadow stack support [v5,21/27] x86/cet/shstk: Introduce WRUSS instruction [v5,22/27] x86/cet/shstk: Signal handling for shadow stack [v5,23/27] x86/cet/shstk: ELF header parsing of Shadow Stack [v5,24/27] x86/cet/shstk: Handle thread shadow stack [v5,25/27] mm/mmap: Add Shadow stack pages to memory accounting [v5,26/27] x86/cet/shstk: Add arch_prctl functions for Shadow Stack [v5,27/27] x86/cet/shstk: Add Shadow Stack instructions to opcode map

Yu-cheng Yu Oct. 11, 2018, 3:15 p.m. UTC

Create a guard area between VMAs to detect memory corruption.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 include/linux/mm.h | 30 ++++++++++++++++++++----------
 mm/Kconfig         |  7 +++++++
 2 files changed, 27 insertions(+), 10 deletions(-)

Jann Horn Oct. 11, 2018, 8:39 p.m. UTC | #1

On Thu, Oct 11, 2018 at 5:20 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> Create a guard area between VMAs to detect memory corruption.
[...]
> +config VM_AREA_GUARD
> +       bool "VM area guard"
> +       default n
> +       help
> +         Create a guard area between VM areas so that access beyond
> +         limit can be detected.
> +
>  endmenu

Sorry to bring this up so late, but Daniel Micay pointed out to me
that, given that VMA guards will raise the number of VMAs by
inhibiting vma_merge(), people are more likely to run into
/proc/sys/vm/max_map_count (which limits the number of VMAs to ~65k by
default, and can't easily be raised without risking an overflow of
page->_mapcount on systems with over ~800GiB of RAM, see
https://lore.kernel.org/lkml/20180208021112.GB14918@bombadil.infradead.org/
and replies) with this change.

Playing with glibc's memory allocator, it looks like glibc will use
mmap() for 128KB allocations; so at 65530*128KB=8GB of memory usage in
128KB chunks, an application could run out of VMAs.

People already run into that limit sometimes when mapping files, and
recommend raising it:

https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html
http://docs.actian.com/vector/4.2/User/Increase_max_map_count_Kernel_Parameter_(Linux).htm
https://www.suse.com/de-de/support/kb/doc/?id=7000830 (they actually
ran into ENOMEM on **munmap**, because you can't split VMAs once the
limit is reached): "A custom application was failing on a SLES server
with ENOMEM errors when attempting to release memory using an munmap
call. This resulted in memory failing to be released, and the system
load and swap use increasing until the SLES machine ultimately crashed
or hung."
https://access.redhat.com/solutions/99913
https://forum.manjaro.org/t/resolved-how-to-set-vm-max-map-count-during-boot/43360

Arguably the proper solution to this would be to raise the default
max_map_count to be much higher; but then that requires fixing the
mapcount overflow.

Yu-cheng Yu Oct. 11, 2018, 8:49 p.m. UTC | #2

On Thu, 2018-10-11 at 22:39 +0200, Jann Horn wrote:
> On Thu, Oct 11, 2018 at 5:20 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > Create a guard area between VMAs to detect memory corruption.
> 
> [...]
> > +config VM_AREA_GUARD
> > +       bool "VM area guard"
> > +       default n
> > +       help
> > +         Create a guard area between VM areas so that access beyond
> > +         limit can be detected.
> > +
> >  endmenu
> 
> Sorry to bring this up so late, but Daniel Micay pointed out to me
> that, given that VMA guards will raise the number of VMAs by
> inhibiting vma_merge(), people are more likely to run into
> /proc/sys/vm/max_map_count (which limits the number of VMAs to ~65k by
> default, and can't easily be raised without risking an overflow of
> page->_mapcount on systems with over ~800GiB of RAM, see
> https://lore.kernel.org/lkml/20180208021112.GB14918@bombadil.infradead.org/
> and replies) with this change.

Can we use the VMA guard only for Shadow Stacks?

Yu-cheng

Dave Hansen Oct. 11, 2018, 8:49 p.m. UTC | #3

On 10/11/2018 08:15 AM, Yu-cheng Yu wrote:
> Create a guard area between VMAs to detect memory corruption.

This is a pretty major change that has a bunch of end-user implications.
 It's not dependent on any debugging options and can't be turned on/off
by individual apps, at runtime, or even at boot.

Its connection to this series is also tenuous and not spelled out in the
exceptionally terse changelog.

Andy Lutomirski Oct. 11, 2018, 8:55 p.m. UTC | #4

On Thu, Oct 11, 2018 at 1:39 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Oct 11, 2018 at 5:20 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > Create a guard area between VMAs to detect memory corruption.
> [...]
> > +config VM_AREA_GUARD
> > +       bool "VM area guard"
> > +       default n
> > +       help
> > +         Create a guard area between VM areas so that access beyond
> > +         limit can be detected.
> > +
> >  endmenu
>
> Sorry to bring this up so late, but Daniel Micay pointed out to me
> that, given that VMA guards will raise the number of VMAs by
> inhibiting vma_merge(), people are more likely to run into
> /proc/sys/vm/max_map_count (which limits the number of VMAs to ~65k by
> default, and can't easily be raised without risking an overflow of
> page->_mapcount on systems with over ~800GiB of RAM, see
> https://lore.kernel.org/lkml/20180208021112.GB14918@bombadil.infradead.org/
> and replies) with this change.
>
> Playing with glibc's memory allocator, it looks like glibc will use
> mmap() for 128KB allocations; so at 65530*128KB=8GB of memory usage in
> 128KB chunks, an application could run out of VMAs.

Ugh.

Do we have a free VM flag so we could do VM_GUARD to force a guard
page?  (And to make sure that, when a new VMA is allocated, it won't
be directly adjacent to a VM_GUARD VMA.)

Florian Weimer Oct. 12, 2018, 10:24 a.m. UTC | #5

* Dave Hansen:

> On 10/11/2018 08:15 AM, Yu-cheng Yu wrote:
>> Create a guard area between VMAs to detect memory corruption.
>
> This is a pretty major change that has a bunch of end-user implications.
>  It's not dependent on any debugging options and can't be turned on/off
> by individual apps, at runtime, or even at boot.
>
> Its connection to this series is also tenuous and not spelled out in the
> exceptionally terse changelog.

I agree.  We did have application failures due to the introduction of
the stack gap, so this change is likely to cause failures when applied
to existing mappings as well.

Matthew Wilcox (Oracle) Oct. 12, 2018, 1:17 p.m. UTC | #6

On Thu, Oct 11, 2018 at 10:39:24PM +0200, Jann Horn wrote:
> Sorry to bring this up so late, but Daniel Micay pointed out to me
> that, given that VMA guards will raise the number of VMAs by
> inhibiting vma_merge(), people are more likely to run into
> /proc/sys/vm/max_map_count (which limits the number of VMAs to ~65k by
> default, and can't easily be raised without risking an overflow of
> page->_mapcount on systems with over ~800GiB of RAM, see
> https://lore.kernel.org/lkml/20180208021112.GB14918@bombadil.infradead.org/
> and replies) with this change.
> 
[...]
> 
> Arguably the proper solution to this would be to raise the default
> max_map_count to be much higher; but then that requires fixing the
> mapcount overflow.

I have a fix that nobody has any particular reaction to:

diff --git a/mm/internal.h b/mm/internal.h
index 7059a8389194..977852b8329e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,6 +97,11 @@ extern void putback_lru_page(struct page *page);
  */
 extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
 
+#ifdef CONFIG_64BIT
+extern void mm_mapcount_overflow(struct page *page);
+#else
+static inline void mm_mapcount_overflow(struct page *page) { }
+#endif
 /*
  * in mm/page_alloc.c
  */
diff --git a/mm/mmap.c b/mm/mmap.c
index 9efdc021ad22..575766ec02f8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1315,6 +1315,115 @@ static inline int mlock_future_check(struct mm_struct *mm,
 	return 0;
 }
 
+#ifdef CONFIG_64BIT
+/*
+ * Machines with more than 2TB of memory can create enough VMAs to overflow
+ * page->_mapcount if they all point to the same page.  32-bit machines do
+ * not need to be concerned.
+ */
+/*
+ * Experimentally determined.  gnome-shell currently uses fewer than
+ * 3000 mappings, so should have zero effect on desktop users.
+ */
+#define mm_track_threshold	5000
+static DEFINE_SPINLOCK(heavy_users_lock);
+static DEFINE_IDR(heavy_users);
+
+static void mmap_track_user(struct mm_struct *mm, int max)
+{
+	struct mm_struct *entry;
+	unsigned int id;
+
+	idr_preload(GFP_KERNEL);
+	spin_lock(&heavy_users_lock);
+	idr_for_each_entry(&heavy_users, entry, id) {
+		if (entry == mm)
+			break;
+		if (entry->map_count < mm_track_threshold)
+			idr_remove(&heavy_users, id);
+	}
+	if (!entry)
+		idr_alloc(&heavy_users, mm, 0, 0, GFP_ATOMIC);
+	spin_unlock(&heavy_users_lock);
+}
+
+static void mmap_untrack_user(struct mm_struct *mm)
+{
+	struct mm_struct *entry;
+	unsigned int id;
+
+	spin_lock(&heavy_users_lock);
+	idr_for_each_entry(&heavy_users, entry, id) {
+		if (entry == mm) {
+			idr_remove(&heavy_users, id);
+			break;
+		}
+	}
+	spin_unlock(&heavy_users_lock);
+}
+
+static void kill_mm(struct task_struct *tsk)
+{
+	/* Tear down the mappings first */
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, tsk, true);
+}
+
+static void kill_abuser(struct mm_struct *mm)
+{
+	struct task_struct *tsk;
+
+	for_each_process(tsk)
+		if (tsk->mm == mm)
+			break;
+
+	if (down_write_trylock(&mm->mmap_sem)) {
+		kill_mm(tsk);
+		up_write(&mm->mmap_sem);
+	} else {
+		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, tsk, true);
+	}
+}
+
+void mm_mapcount_overflow(struct page *page)
+{
+	struct mm_struct *entry = current->mm;
+	unsigned int id;
+	struct vm_area_struct *vma;
+	struct address_space *mapping = page_mapping(page);
+	unsigned long pgoff = page_to_pgoff(page);
+	unsigned int count = 0;
+
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff + 1) {
+		if (vma->vm_mm == entry)
+			count++;
+		if (count > 1000)
+			kill_mm(current);
+	}
+
+	rcu_read_lock();
+	idr_for_each_entry(&heavy_users, entry, id) {
+		count = 0;
+
+		vma_interval_tree_foreach(vma, &mapping->i_mmap,
+				pgoff, pgoff + 1) {
+			if (vma->vm_mm == entry)
+				count++;
+			if (count > 1000) {
+				kill_abuser(entry);
+				goto out;
+			}
+		}
+	}
+	if (!entry)
+		panic("No abusers found but mapcount exceeded\n");
+out:
+	rcu_read_unlock();
+}
+#else
+static void mmap_track_user(struct mm_struct *mm, int max) { }
+static void mmap_untrack_user(struct mm_struct *mm) { }
+#endif
+
 /*
  * The caller must hold down_write(&current->mm->mmap_sem).
  */
@@ -1357,6 +1466,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	/* Too many mappings? */
 	if (mm->map_count > sysctl_max_map_count)
 		return -ENOMEM;
+	if (mm->map_count > mm_track_threshold)
+		mmap_track_user(mm, mm_track_threshold);
 
 	/* Obtain the address to map to. we verify (or select) it and ensure
 	 * that it represents a valid section of the address space.
@@ -2997,6 +3108,8 @@ void exit_mmap(struct mm_struct *mm)
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
+	mmap_untrack_user(mm);
+
 	if (mm->locked_vm) {
 		vma = mm->mmap;
 		while (vma) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 47db27f8049e..d88acf5c98e9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1190,6 +1190,7 @@ void page_add_file_rmap(struct page *page, bool compound)
 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 		__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
 	} else {
+		int v;
 		if (PageTransCompound(page) && page_mapping(page)) {
 			VM_WARN_ON_ONCE(!PageLocked(page));
 
@@ -1197,8 +1198,13 @@ void page_add_file_rmap(struct page *page, bool compound)
 			if (PageMlocked(page))
 				clear_page_mlock(compound_head(page));
 		}
-		if (!atomic_inc_and_test(&page->_mapcount))
+		v = atomic_inc_return(&page->_mapcount);
+		if (likely(v > 0))
 			goto out;
+		if (unlikely(v < 0)) {
+			mm_mapcount_overflow(page);
+			goto out;
+		}
 	}
 	__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
 out:

Yu-cheng Yu Oct. 12, 2018, 9:49 p.m. UTC | #7

On Thu, 2018-10-11 at 13:55 -0700, Andy Lutomirski wrote:
> On Thu, Oct 11, 2018 at 1:39 PM Jann Horn <jannh@google.com> wrote:
> > 
> > On Thu, Oct 11, 2018 at 5:20 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > > Create a guard area between VMAs to detect memory corruption.
> > 
> > [...]
> > > +config VM_AREA_GUARD
> > > +       bool "VM area guard"
> > > +       default n
> > > +       help
> > > +         Create a guard area between VM areas so that access beyond
> > > +         limit can be detected.
> > > +
> > >  endmenu
> > 
> > Sorry to bring this up so late, but Daniel Micay pointed out to me
> > that, given that VMA guards will raise the number of VMAs by
> > inhibiting vma_merge(), people are more likely to run into
> > /proc/sys/vm/max_map_count (which limits the number of VMAs to ~65k by
> > default, and can't easily be raised without risking an overflow of
> > page->_mapcount on systems with over ~800GiB of RAM, see
> > https://lore.kernel.org/lkml/20180208021112.GB14918@bombadil.infradead.org/
> > and replies) with this change.
> > 
> > Playing with glibc's memory allocator, it looks like glibc will use
> > mmap() for 128KB allocations; so at 65530*128KB=8GB of memory usage in
> > 128KB chunks, an application could run out of VMAs.
> 
> Ugh.
> 
> Do we have a free VM flag so we could do VM_GUARD to force a guard
> page?  (And to make sure that, when a new VMA is allocated, it won't
> be directly adjacent to a VM_GUARD VMA.)

Maybe something like the following?  These vm_start_gap()/vm_end_gap() are used
in many architectures.  Do we want to put them in a different series?  Comments?

Yu-cheng




diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0416a7204be3..92b580542411 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -224,11 +224,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit
architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit
architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit
architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit
architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -266,6 +268,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MPX		VM_NONE
 #endif
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_GUARD	VM_HIGH_ARCH_5
+#else
+#define VM_GUARD	VM_NONE
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
@@ -2417,24 +2425,34 @@ static inline struct vm_area_struct *
find_vma_intersection(struct mm_struct * m
-static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
+static inline unsigned long vm_start_gap(struct vm_area_struct *vma, vm_flags_t
flags)
 {
 	unsigned long vm_start = vma->vm_start;
+	unsigned long gap = 0;
+
+	if (vma->vm_flags & VM_GROWSDOWN)
+		gap = stack_guard_gap;
+	else if ((vma->vm_flags & VM_GUARD) || (flags & VM_GUARD))
+		gap = PAGE_SIZE;
+
+	vm_start -= gap;
+	if (vm_start > vma->vm_start)
+		vm_start = 0;
 
-	if (vma->vm_flags & VM_GROWSDOWN) {
-		vm_start -= stack_guard_gap;
-		if (vm_start > vma->vm_start)
-			vm_start = 0;
-	}
 	return vm_start;
 }
 
-static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
+static inline unsigned long vm_end_gap(struct vm_area_struct *vma, vm_flags_t
flags)
 {
 	unsigned long vm_end = vma->vm_end;
+	unsigned long gap = 0;
+
+	if (vma->vm_flags & VM_GROWSUP)
+		gap = stack_guard_gap;
+	else if ((vma->vm_flags & VM_GUARD) || (flags & VM_GUARD))
+		gap = PAGE_SIZE;
+
+	vm_end += gap;
+	if (vm_end < vma->vm_end)
+		vm_end = -PAGE_SIZE;
 
-	if (vma->vm_flags & VM_GROWSUP) {
-		vm_end += stack_guard_gap;
-		if (vm_end < vma->vm_end)
-			vm_end = -PAGE_SIZE;
-	}
 	return vm_end;
 }

[v5,07/27] mm/mmap: Create a guard area between VMAs

Commit Message

Comments

Patch