Message ID | 20181011151523.27101-8-yu-cheng.yu@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Control Flow Enforcement: Shadow Stack | expand |
On Thu, Oct 11, 2018 at 5:20 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote: > Create a guard area between VMAs to detect memory corruption. [...] > +config VM_AREA_GUARD > + bool "VM area guard" > + default n > + help > + Create a guard area between VM areas so that access beyond > + limit can be detected. > + > endmenu Sorry to bring this up so late, but Daniel Micay pointed out to me that, given that VMA guards will raise the number of VMAs by inhibiting vma_merge(), people are more likely to run into /proc/sys/vm/max_map_count (which limits the number of VMAs to ~65k by default, and can't easily be raised without risking an overflow of page->_mapcount on systems with over ~800GiB of RAM, see https://lore.kernel.org/lkml/20180208021112.GB14918@bombadil.infradead.org/ and replies) with this change. Playing with glibc's memory allocator, it looks like glibc will use mmap() for 128KB allocations; so at 65530*128KB=8GB of memory usage in 128KB chunks, an application could run out of VMAs. People already run into that limit sometimes when mapping files, and recommend raising it: https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html http://docs.actian.com/vector/4.2/User/Increase_max_map_count_Kernel_Parameter_(Linux).htm https://www.suse.com/de-de/support/kb/doc/?id=7000830 (they actually ran into ENOMEM on **munmap**, because you can't split VMAs once the limit is reached): "A custom application was failing on a SLES server with ENOMEM errors when attempting to release memory using an munmap call. This resulted in memory failing to be released, and the system load and swap use increasing until the SLES machine ultimately crashed or hung." https://access.redhat.com/solutions/99913 https://forum.manjaro.org/t/resolved-how-to-set-vm-max-map-count-during-boot/43360 Arguably the proper solution to this would be to raise the default max_map_count to be much higher; but then that requires fixing the mapcount overflow.
On Thu, 2018-10-11 at 22:39 +0200, Jann Horn wrote: > On Thu, Oct 11, 2018 at 5:20 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote: > > Create a guard area between VMAs to detect memory corruption. > > [...] > > +config VM_AREA_GUARD > > + bool "VM area guard" > > + default n > > + help > > + Create a guard area between VM areas so that access beyond > > + limit can be detected. > > + > > endmenu > > Sorry to bring this up so late, but Daniel Micay pointed out to me > that, given that VMA guards will raise the number of VMAs by > inhibiting vma_merge(), people are more likely to run into > /proc/sys/vm/max_map_count (which limits the number of VMAs to ~65k by > default, and can't easily be raised without risking an overflow of > page->_mapcount on systems with over ~800GiB of RAM, see > https://lore.kernel.org/lkml/20180208021112.GB14918@bombadil.infradead.org/ > and replies) with this change. Can we use the VMA guard only for Shadow Stacks? Yu-cheng
On 10/11/2018 08:15 AM, Yu-cheng Yu wrote:
> Create a guard area between VMAs to detect memory corruption.
This is a pretty major change that has a bunch of end-user implications.
It's not dependent on any debugging options and can't be turned on/off
by individual apps, at runtime, or even at boot.
Its connection to this series is also tenuous and not spelled out in the
exceptionally terse changelog.
On Thu, Oct 11, 2018 at 1:39 PM Jann Horn <jannh@google.com> wrote: > > On Thu, Oct 11, 2018 at 5:20 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote: > > Create a guard area between VMAs to detect memory corruption. > [...] > > +config VM_AREA_GUARD > > + bool "VM area guard" > > + default n > > + help > > + Create a guard area between VM areas so that access beyond > > + limit can be detected. > > + > > endmenu > > Sorry to bring this up so late, but Daniel Micay pointed out to me > that, given that VMA guards will raise the number of VMAs by > inhibiting vma_merge(), people are more likely to run into > /proc/sys/vm/max_map_count (which limits the number of VMAs to ~65k by > default, and can't easily be raised without risking an overflow of > page->_mapcount on systems with over ~800GiB of RAM, see > https://lore.kernel.org/lkml/20180208021112.GB14918@bombadil.infradead.org/ > and replies) with this change. > > Playing with glibc's memory allocator, it looks like glibc will use > mmap() for 128KB allocations; so at 65530*128KB=8GB of memory usage in > 128KB chunks, an application could run out of VMAs. Ugh. Do we have a free VM flag so we could do VM_GUARD to force a guard page? (And to make sure that, when a new VMA is allocated, it won't be directly adjacent to a VM_GUARD VMA.)
* Dave Hansen: > On 10/11/2018 08:15 AM, Yu-cheng Yu wrote: >> Create a guard area between VMAs to detect memory corruption. > > This is a pretty major change that has a bunch of end-user implications. > It's not dependent on any debugging options and can't be turned on/off > by individual apps, at runtime, or even at boot. > > Its connection to this series is also tenuous and not spelled out in the > exceptionally terse changelog. I agree. We did have application failures due to the introduction of the stack gap, so this change is likely to cause failures when applied to existing mappings as well.
On Thu, Oct 11, 2018 at 10:39:24PM +0200, Jann Horn wrote: > Sorry to bring this up so late, but Daniel Micay pointed out to me > that, given that VMA guards will raise the number of VMAs by > inhibiting vma_merge(), people are more likely to run into > /proc/sys/vm/max_map_count (which limits the number of VMAs to ~65k by > default, and can't easily be raised without risking an overflow of > page->_mapcount on systems with over ~800GiB of RAM, see > https://lore.kernel.org/lkml/20180208021112.GB14918@bombadil.infradead.org/ > and replies) with this change. > [...] > > Arguably the proper solution to this would be to raise the default > max_map_count to be much higher; but then that requires fixing the > mapcount overflow. I have a fix that nobody has any particular reaction to: diff --git a/mm/internal.h b/mm/internal.h index 7059a8389194..977852b8329e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -97,6 +97,11 @@ extern void putback_lru_page(struct page *page); */ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); +#ifdef CONFIG_64BIT +extern void mm_mapcount_overflow(struct page *page); +#else +static inline void mm_mapcount_overflow(struct page *page) { } +#endif /* * in mm/page_alloc.c */ diff --git a/mm/mmap.c b/mm/mmap.c index 9efdc021ad22..575766ec02f8 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1315,6 +1315,115 @@ static inline int mlock_future_check(struct mm_struct *mm, return 0; } +#ifdef CONFIG_64BIT +/* + * Machines with more than 2TB of memory can create enough VMAs to overflow + * page->_mapcount if they all point to the same page. 32-bit machines do + * not need to be concerned. + */ +/* + * Experimentally determined. gnome-shell currently uses fewer than + * 3000 mappings, so should have zero effect on desktop users. + */ +#define mm_track_threshold 5000 +static DEFINE_SPINLOCK(heavy_users_lock); +static DEFINE_IDR(heavy_users); + +static void mmap_track_user(struct mm_struct *mm, int max) +{ + struct mm_struct *entry; + unsigned int id; + + idr_preload(GFP_KERNEL); + spin_lock(&heavy_users_lock); + idr_for_each_entry(&heavy_users, entry, id) { + if (entry == mm) + break; + if (entry->map_count < mm_track_threshold) + idr_remove(&heavy_users, id); + } + if (!entry) + idr_alloc(&heavy_users, mm, 0, 0, GFP_ATOMIC); + spin_unlock(&heavy_users_lock); +} + +static void mmap_untrack_user(struct mm_struct *mm) +{ + struct mm_struct *entry; + unsigned int id; + + spin_lock(&heavy_users_lock); + idr_for_each_entry(&heavy_users, entry, id) { + if (entry == mm) { + idr_remove(&heavy_users, id); + break; + } + } + spin_unlock(&heavy_users_lock); +} + +static void kill_mm(struct task_struct *tsk) +{ + /* Tear down the mappings first */ + do_send_sig_info(SIGKILL, SEND_SIG_FORCED, tsk, true); +} + +static void kill_abuser(struct mm_struct *mm) +{ + struct task_struct *tsk; + + for_each_process(tsk) + if (tsk->mm == mm) + break; + + if (down_write_trylock(&mm->mmap_sem)) { + kill_mm(tsk); + up_write(&mm->mmap_sem); + } else { + do_send_sig_info(SIGKILL, SEND_SIG_FORCED, tsk, true); + } +} + +void mm_mapcount_overflow(struct page *page) +{ + struct mm_struct *entry = current->mm; + unsigned int id; + struct vm_area_struct *vma; + struct address_space *mapping = page_mapping(page); + unsigned long pgoff = page_to_pgoff(page); + unsigned int count = 0; + + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff + 1) { + if (vma->vm_mm == entry) + count++; + if (count > 1000) + kill_mm(current); + } + + rcu_read_lock(); + idr_for_each_entry(&heavy_users, entry, id) { + count = 0; + + vma_interval_tree_foreach(vma, &mapping->i_mmap, + pgoff, pgoff + 1) { + if (vma->vm_mm == entry) + count++; + if (count > 1000) { + kill_abuser(entry); + goto out; + } + } + } + if (!entry) + panic("No abusers found but mapcount exceeded\n"); +out: + rcu_read_unlock(); +} +#else +static void mmap_track_user(struct mm_struct *mm, int max) { } +static void mmap_untrack_user(struct mm_struct *mm) { } +#endif + /* * The caller must hold down_write(¤t->mm->mmap_sem). */ @@ -1357,6 +1466,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr, /* Too many mappings? */ if (mm->map_count > sysctl_max_map_count) return -ENOMEM; + if (mm->map_count > mm_track_threshold) + mmap_track_user(mm, mm_track_threshold); /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address space. @@ -2997,6 +3108,8 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + mmap_untrack_user(mm); + if (mm->locked_vm) { vma = mm->mmap; while (vma) { diff --git a/mm/rmap.c b/mm/rmap.c index 47db27f8049e..d88acf5c98e9 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1190,6 +1190,7 @@ void page_add_file_rmap(struct page *page, bool compound) VM_BUG_ON_PAGE(!PageSwapBacked(page), page); __inc_node_page_state(page, NR_SHMEM_PMDMAPPED); } else { + int v; if (PageTransCompound(page) && page_mapping(page)) { VM_WARN_ON_ONCE(!PageLocked(page)); @@ -1197,8 +1198,13 @@ void page_add_file_rmap(struct page *page, bool compound) if (PageMlocked(page)) clear_page_mlock(compound_head(page)); } - if (!atomic_inc_and_test(&page->_mapcount)) + v = atomic_inc_return(&page->_mapcount); + if (likely(v > 0)) goto out; + if (unlikely(v < 0)) { + mm_mapcount_overflow(page); + goto out; + } } __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); out:
On Thu, 2018-10-11 at 13:55 -0700, Andy Lutomirski wrote: > On Thu, Oct 11, 2018 at 1:39 PM Jann Horn <jannh@google.com> wrote: > > > > On Thu, Oct 11, 2018 at 5:20 PM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote: > > > Create a guard area between VMAs to detect memory corruption. > > > > [...] > > > +config VM_AREA_GUARD > > > + bool "VM area guard" > > > + default n > > > + help > > > + Create a guard area between VM areas so that access beyond > > > + limit can be detected. > > > + > > > endmenu > > > > Sorry to bring this up so late, but Daniel Micay pointed out to me > > that, given that VMA guards will raise the number of VMAs by > > inhibiting vma_merge(), people are more likely to run into > > /proc/sys/vm/max_map_count (which limits the number of VMAs to ~65k by > > default, and can't easily be raised without risking an overflow of > > page->_mapcount on systems with over ~800GiB of RAM, see > > https://lore.kernel.org/lkml/20180208021112.GB14918@bombadil.infradead.org/ > > and replies) with this change. > > > > Playing with glibc's memory allocator, it looks like glibc will use > > mmap() for 128KB allocations; so at 65530*128KB=8GB of memory usage in > > 128KB chunks, an application could run out of VMAs. > > Ugh. > > Do we have a free VM flag so we could do VM_GUARD to force a guard > page? (And to make sure that, when a new VMA is allocated, it won't > be directly adjacent to a VM_GUARD VMA.) Maybe something like the following? These vm_start_gap()/vm_end_gap() are used in many architectures. Do we want to put them in a different series? Comments? Yu-cheng diff --git a/include/linux/mm.h b/include/linux/mm.h index 0416a7204be3..92b580542411 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -224,11 +224,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #ifdef CONFIG_ARCH_HAS_PKEYS @@ -266,6 +268,12 @@ extern unsigned int kobjsize(const void *objp); # define VM_MPX VM_NONE #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS +#define VM_GUARD VM_HIGH_ARCH_5 +#else +#define VM_GUARD VM_NONE +#endif + #ifndef VM_GROWSUP # define VM_GROWSUP VM_NONE #endif @@ -2417,24 +2425,34 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m -static inline unsigned long vm_start_gap(struct vm_area_struct *vma) +static inline unsigned long vm_start_gap(struct vm_area_struct *vma, vm_flags_t flags) { unsigned long vm_start = vma->vm_start; + unsigned long gap = 0; + + if (vma->vm_flags & VM_GROWSDOWN) + gap = stack_guard_gap; + else if ((vma->vm_flags & VM_GUARD) || (flags & VM_GUARD)) + gap = PAGE_SIZE; + + vm_start -= gap; + if (vm_start > vma->vm_start) + vm_start = 0; - if (vma->vm_flags & VM_GROWSDOWN) { - vm_start -= stack_guard_gap; - if (vm_start > vma->vm_start) - vm_start = 0; - } return vm_start; } -static inline unsigned long vm_end_gap(struct vm_area_struct *vma) +static inline unsigned long vm_end_gap(struct vm_area_struct *vma, vm_flags_t flags) { unsigned long vm_end = vma->vm_end; + unsigned long gap = 0; + + if (vma->vm_flags & VM_GROWSUP) + gap = stack_guard_gap; + else if ((vma->vm_flags & VM_GUARD) || (flags & VM_GUARD)) + gap = PAGE_SIZE; + + vm_end += gap; + if (vm_end < vma->vm_end) + vm_end = -PAGE_SIZE; - if (vma->vm_flags & VM_GROWSUP) { - vm_end += stack_guard_gap; - if (vm_end < vma->vm_end) - vm_end = -PAGE_SIZE; - } return vm_end; }
diff --git a/include/linux/mm.h b/include/linux/mm.h index 0416a7204be3..53cfc104c0fb 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2417,24 +2417,34 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m static inline unsigned long vm_start_gap(struct vm_area_struct *vma) { unsigned long vm_start = vma->vm_start; + unsigned long gap = 0; + + if (vma->vm_flags & VM_GROWSDOWN) + gap = stack_guard_gap; + else if (IS_ENABLED(CONFIG_VM_AREA_GUARD)) + gap = PAGE_SIZE; + + vm_start -= gap; + if (vm_start > vma->vm_start) + vm_start = 0; - if (vma->vm_flags & VM_GROWSDOWN) { - vm_start -= stack_guard_gap; - if (vm_start > vma->vm_start) - vm_start = 0; - } return vm_start; } static inline unsigned long vm_end_gap(struct vm_area_struct *vma) { unsigned long vm_end = vma->vm_end; + unsigned long gap = 0; + + if (vma->vm_flags & VM_GROWSUP) + gap = stack_guard_gap; + else if (IS_ENABLED(CONFIG_VM_AREA_GUARD)) + gap = PAGE_SIZE; + + vm_end += gap; + if (vm_end < vma->vm_end) + vm_end = -PAGE_SIZE; - if (vma->vm_flags & VM_GROWSUP) { - vm_end += stack_guard_gap; - if (vm_end < vma->vm_end) - vm_end = -PAGE_SIZE; - } return vm_end; } diff --git a/mm/Kconfig b/mm/Kconfig index de64ea658716..0cdcad65640d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -764,4 +764,11 @@ config GUP_BENCHMARK config ARCH_HAS_PTE_SPECIAL bool +config VM_AREA_GUARD + bool "VM area guard" + default n + help + Create a guard area between VM areas so that access beyond + limit can be detected. + endmenu
Create a guard area between VMAs to detect memory corruption. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> --- include/linux/mm.h | 30 ++++++++++++++++++++---------- mm/Kconfig | 7 +++++++ 2 files changed, 27 insertions(+), 10 deletions(-)