[v3,02/11] mm: Hardened usercopy

Message ID	1468619065-3222-3-git-send-email-keescook@chromium.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <kernel-hardening-return-4001-patchwork-kernel-hardening=patchwork.kernel.org@lists.openwall.com> Mailing-List: contact kernel-hardening-help@lists.openwall.com; run by ezmlm Precedence: bulk Reply-To: kernel-hardening@lists.openwall.com From: Kees Cook <keescook@chromium.org> To: linux-kernel@vger.kernel.org Cc: Kees Cook <keescook@chromium.org>, Balbir Singh <bsingharora@gmail.com>, Daniel Micay <danielmicay@gmail.com>, Josh Poimboeuf <jpoimboe@redhat.com>, Rik van Riel <riel@redhat.com>, Casey Schaufler <casey@schaufler-ca.com>, PaX Team <pageexec@freemail.hu>, Brad Spengler <spender@grsecurity.net>, Russell King <linux@armlinux.org.uk>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will.deacon@arm.com>, Ard Biesheuvel <ard.biesheuvel@linaro.org>, Benjamin Herrenschmidt <benh@kernel.crashing.org>, Michael Ellerman <mpe@ellerman.id.au>, Tony Luck <tony.luck@intel.com>, Fenghua Yu <fenghua.yu@intel.com>, "David S. Miller" <davem@davemloft.net>, x86@kernel.org, Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Borislav Petkov <bp@suse.de>, Mathias Krause <minipli@googlemail.com>, Jan Kara <jack@suse.cz>, Vitaly Wool <vitalywool@gmail.com>, Andrea Arcangeli <aarcange@redhat.com>, Dmitry Vyukov <dvyukov@google.com>, Laura Abbott <labbott@fedoraproject.org>, linux-arm-kernel@lists.infradead.org, linux-ia64@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, sparclinux@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, kernel-hardening@lists.openwall.com Date: Fri, 15 Jul 2016 14:44:16 -0700 Message-Id: <1468619065-3222-3-git-send-email-keescook@chromium.org> In-Reply-To: <1468619065-3222-1-git-send-email-keescook@chromium.org> References: <1468619065-3222-1-git-send-email-keescook@chromium.org> Subject: [kernel-hardening] [PATCH v3 02/11] mm: Hardened usercopy

Kees Cook July 15, 2016, 9:44 p.m. UTC

This is the start of porting PAX_USERCOPY into the mainline kernel. This
is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
work is based on code by PaX Team and Brad Spengler, and an earlier port
from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.

This patch contains the logic for validating several conditions when
performing copy_to_user() and copy_from_user() on the kernel object
being copied to/from:
- address range doesn't wrap around
- address range isn't NULL or zero-allocated (with a non-zero copy size)
- if on the slab allocator:
  - object size must be less than or equal to copy size (when check is
    implemented in the allocator, which appear in subsequent patches)
- otherwise, object must not span page allocations
- if on the stack
  - object must not extend before/after the current process task
  - object must be contained by the current stack frame (when there is
    arch/build support for identifying stack frames)
- object must not overlap with kernel text

Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-By: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/Kconfig                |   7 ++
 include/linux/slab.h        |  12 +++
 include/linux/thread_info.h |  15 +++
 mm/Makefile                 |   4 +
 mm/usercopy.c               | 234 ++++++++++++++++++++++++++++++++++++++++++++
 security/Kconfig            |  28 ++++++
 6 files changed, 300 insertions(+)
 create mode 100644 mm/usercopy.c

Laura Abbott July 19, 2016, 1:52 a.m. UTC | #1

On 07/15/2016 02:44 PM, Kees Cook wrote:
> This is the start of porting PAX_USERCOPY into the mainline kernel. This
> is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
> work is based on code by PaX Team and Brad Spengler, and an earlier port
> from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
>
> This patch contains the logic for validating several conditions when
> performing copy_to_user() and copy_from_user() on the kernel object
> being copied to/from:
> - address range doesn't wrap around
> - address range isn't NULL or zero-allocated (with a non-zero copy size)
> - if on the slab allocator:
>   - object size must be less than or equal to copy size (when check is
>     implemented in the allocator, which appear in subsequent patches)
> - otherwise, object must not span page allocations
> - if on the stack
>   - object must not extend before/after the current process task
>   - object must be contained by the current stack frame (when there is
>     arch/build support for identifying stack frames)
> - object must not overlap with kernel text
>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> Tested-By: Valdis Kletnieks <valdis.kletnieks@vt.edu>
> Tested-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  arch/Kconfig                |   7 ++
>  include/linux/slab.h        |  12 +++
>  include/linux/thread_info.h |  15 +++
>  mm/Makefile                 |   4 +
>  mm/usercopy.c               | 234 ++++++++++++++++++++++++++++++++++++++++++++
>  security/Kconfig            |  28 ++++++
>  6 files changed, 300 insertions(+)
>  create mode 100644 mm/usercopy.c
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 5e2776562035..195ee4cc939a 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -433,6 +433,13 @@ config HAVE_ARCH_WITHIN_STACK_FRAMES
>  	  and similar) by implementing an inline arch_within_stack_frames(),
>  	  which is used by CONFIG_HARDENED_USERCOPY.
>
> +config HAVE_ARCH_LINEAR_KERNEL_MAPPING
> +	bool
> +	help
> +	  An architecture should select this if it has a secondary linear
> +	  mapping of the kernel text. This is used to verify that kernel
> +	  text exposures are not visible under CONFIG_HARDENED_USERCOPY.
> +
>  config HAVE_CONTEXT_TRACKING
>  	bool
>  	help
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index aeb3e6d00a66..96a16a3fb7cb 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -155,6 +155,18 @@ void kfree(const void *);
>  void kzfree(const void *);
>  size_t ksize(const void *);
>
> +#ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
> +const char *__check_heap_object(const void *ptr, unsigned long n,
> +				struct page *page);
> +#else
> +static inline const char *__check_heap_object(const void *ptr,
> +					      unsigned long n,
> +					      struct page *page)
> +{
> +	return NULL;
> +}
> +#endif
> +
>  /*
>   * Some archs want to perform DMA into kmalloc caches and need a guaranteed
>   * alignment larger than the alignment of a 64-bit integer.
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 3d5c80b4391d..f24b99eac969 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -155,6 +155,21 @@ static inline int arch_within_stack_frames(const void * const stack,
>  }
>  #endif
>
> +#ifdef CONFIG_HARDENED_USERCOPY
> +extern void __check_object_size(const void *ptr, unsigned long n,
> +					bool to_user);
> +
> +static inline void check_object_size(const void *ptr, unsigned long n,
> +				     bool to_user)
> +{
> +	__check_object_size(ptr, n, to_user);
> +}
> +#else
> +static inline void check_object_size(const void *ptr, unsigned long n,
> +				     bool to_user)
> +{ }
> +#endif /* CONFIG_HARDENED_USERCOPY */
> +
>  #endif	/* __KERNEL__ */
>
>  #endif /* _LINUX_THREAD_INFO_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 78c6f7dedb83..32d37247c7e5 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -21,6 +21,9 @@ KCOV_INSTRUMENT_memcontrol.o := n
>  KCOV_INSTRUMENT_mmzone.o := n
>  KCOV_INSTRUMENT_vmstat.o := n
>
> +# Since __builtin_frame_address does work as used, disable the warning.
> +CFLAGS_usercopy.o += $(call cc-disable-warning, frame-address)
> +
>  mmu-y			:= nommu.o
>  mmu-$(CONFIG_MMU)	:= gup.o highmem.o memory.o mincore.o \
>  			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
> @@ -99,3 +102,4 @@ obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
>  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
>  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
> +obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
> diff --git a/mm/usercopy.c b/mm/usercopy.c
> new file mode 100644
> index 000000000000..e4bf4e7ccdf6
> --- /dev/null
> +++ b/mm/usercopy.c
> @@ -0,0 +1,234 @@
> +/*
> + * This implements the various checks for CONFIG_HARDENED_USERCOPY*,
> + * which are designed to protect kernel memory from needless exposure
> + * and overwrite under many unintended conditions. This code is based
> + * on PAX_USERCOPY, which is:
> + *
> + * Copyright (C) 2001-2016 PaX Team, Bradley Spengler, Open Source
> + * Security Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + */
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <asm/sections.h>
> +
> +enum {
> +	BAD_STACK = -1,
> +	NOT_STACK = 0,
> +	GOOD_FRAME,
> +	GOOD_STACK,
> +};
> +
> +/*
> + * Checks if a given pointer and length is contained by the current
> + * stack frame (if possible).
> + *
> + *	0: not at all on the stack
> + *	1: fully within a valid stack frame
> + *	2: fully on the stack (when can't do frame-checking)
> + *	-1: error condition (invalid stack position or bad stack frame)
> + */
> +static noinline int check_stack_object(const void *obj, unsigned long len)
> +{
> +	const void * const stack = task_stack_page(current);
> +	const void * const stackend = stack + THREAD_SIZE;
> +	int ret;
> +
> +	/* Object is not on the stack at all. */
> +	if (obj + len <= stack || stackend <= obj)
> +		return NOT_STACK;
> +
> +	/*
> +	 * Reject: object partially overlaps the stack (passing the
> +	 * the check above means at least one end is within the stack,
> +	 * so if this check fails, the other end is outside the stack).
> +	 */
> +	if (obj < stack || stackend < obj + len)
> +		return BAD_STACK;
> +
> +	/* Check if object is safely within a valid frame. */
> +	ret = arch_within_stack_frames(stack, stackend, obj, len);
> +	if (ret)
> +		return ret;
> +
> +	return GOOD_STACK;
> +}
> +
> +static void report_usercopy(const void *ptr, unsigned long len,
> +			    bool to_user, const char *type)
> +{
> +	pr_emerg("kernel memory %s attempt detected %s %p (%s) (%lu bytes)\n",
> +		to_user ? "exposure" : "overwrite",
> +		to_user ? "from" : "to", ptr, type ? : "unknown", len);
> +	/*
> +	 * For greater effect, it would be nice to do do_group_exit(),
> +	 * but BUG() actually hooks all the lock-breaking and per-arch
> +	 * Oops code, so that is used here instead.
> +	 */
> +	BUG();
> +}
> +
> +/* Returns true if any portion of [ptr,ptr+n) over laps with [low,high). */
> +static bool overlaps(const void *ptr, unsigned long n, unsigned long low,
> +		     unsigned long high)
> +{
> +	unsigned long check_low = (uintptr_t)ptr;
> +	unsigned long check_high = check_low + n;
> +
> +	/* Does not overlap if entirely above or entirely below. */
> +	if (check_low >= high || check_high < low)
> +		return false;
> +
> +	return true;
> +}
> +
> +/* Is this address range in the kernel text area? */
> +static inline const char *check_kernel_text_object(const void *ptr,
> +						   unsigned long n)
> +{
> +	unsigned long textlow = (unsigned long)_stext;
> +	unsigned long texthigh = (unsigned long)_etext;
> +
> +	if (overlaps(ptr, n, textlow, texthigh))
> +		return "<kernel text>";
> +
> +#ifdef HAVE_ARCH_LINEAR_KERNEL_MAPPING
> +	/* Check against linear mapping as well. */
> +	if (overlaps(ptr, n, (unsigned long)__va(__pa(textlow)),
> +		     (unsigned long)__va(__pa(texthigh))))
> +		return "<linear kernel text>";
> +#endif
> +
> +	return NULL;
> +}
> +
> +static inline const char *check_bogus_address(const void *ptr, unsigned long n)
> +{
> +	/* Reject if object wraps past end of memory. */
> +	if (ptr + n < ptr)
> +		return "<wrapped address>";
> +
> +	/* Reject if NULL or ZERO-allocation. */
> +	if (ZERO_OR_NULL_PTR(ptr))
> +		return "<null>";
> +
> +	return NULL;
> +}
> +
> +static inline const char *check_heap_object(const void *ptr, unsigned long n,
> +					    bool to_user)
> +{
> +	struct page *page, *endpage;
> +	const void *end = ptr + n - 1;
> +
> +	if (!virt_addr_valid(ptr))
> +		return NULL;
> +

virt_addr_valid returns true on vmalloc addresses on arm64 which causes some
intermittent false positives (tab completion in a qemu buildroot environment
was showing it fairly reliably). I think this is an arm64 bug because
virt_addr_valid should return true if and only if virt_to_page returns the
corresponding page. We can work around this for now by explicitly
checking against is_vmalloc_addr.

Thanks,
Laura

> +	page = virt_to_head_page(ptr);
> +
> +	/* Check slab allocator for flags and size. */
> +	if (PageSlab(page))
> +		return __check_heap_object(ptr, n, page);
> +
> +	/*
> +	 * Sometimes the kernel data regions are not marked Reserved (see
> +	 * check below). And sometimes [_sdata,_edata) does not cover
> +	 * rodata and/or bss, so check each range explicitly.
> +	 */
> +
> +	/* Allow reads of kernel rodata region (if not marked as Reserved). */
> +	if (ptr >= (const void *)__start_rodata &&
> +	    end <= (const void *)__end_rodata) {
> +		if (!to_user)
> +			return "<rodata>";
> +		return NULL;
> +	}
> +
> +	/* Allow kernel data region (if not marked as Reserved). */
> +	if (ptr >= (const void *)_sdata && end <= (const void *)_edata)
> +		return NULL;
> +
> +	/* Allow kernel bss region (if not marked as Reserved). */
> +	if (ptr >= (const void *)__bss_start &&
> +	    end <= (const void *)__bss_stop)
> +		return NULL;
> +
> +	/* Is the object wholly within one base page? */
> +	if (likely(((unsigned long)ptr & (unsigned long)PAGE_MASK) ==
> +		   ((unsigned long)end & (unsigned long)PAGE_MASK)))
> +		return NULL;
> +
> +	/* Allow if start and end are inside the same compound page. */
> +	endpage = virt_to_head_page(end);
> +	if (likely(endpage == page))
> +		return NULL;
> +
> +	/*
> +	 * Reject if range is not Reserved (i.e. special or device memory),
> +	 * since then the object spans several independently allocated pages.
> +	 */
> +	for (; ptr <= end ; ptr += PAGE_SIZE, page = virt_to_head_page(ptr)) {
> +		if (!PageReserved(page))
> +			return "<spans multiple pages>";
> +	}
> +
> +	return NULL;
> +}
> +
> +/*
> + * Validates that the given object is one of:
> + * - known safe heap object
> + * - known safe stack object
> + * - not in kernel text
> + */
> +void __check_object_size(const void *ptr, unsigned long n, bool to_user)
> +{
> +	const char *err;
> +
> +	/* Skip all tests if size is zero. */
> +	if (!n)
> +		return;
> +
> +	/* Check for invalid addresses. */
> +	err = check_bogus_address(ptr, n);
> +	if (err)
> +		goto report;
> +
> +	/* Check for bad heap object. */
> +	err = check_heap_object(ptr, n, to_user);
> +	if (err)
> +		goto report;
> +
> +	/* Check for bad stack object. */
> +	switch (check_stack_object(ptr, n)) {
> +	case NOT_STACK:
> +		/* Object is not touching the current process stack. */
> +		break;
> +	case GOOD_FRAME:
> +	case GOOD_STACK:
> +		/*
> +		 * Object is either in the correct frame (when it
> +		 * is possible to check) or just generally on the
> +		 * process stack (when frame checking not available).
> +		 */
> +		return;
> +	default:
> +		err = "<process stack>";
> +		goto report;
> +	}
> +
> +	/* Check for object in kernel to avoid text exposure. */
> +	err = check_kernel_text_object(ptr, n);
> +	if (!err)
> +		return;
> +
> +report:
> +	report_usercopy(ptr, n, to_user, err);
> +}
> +EXPORT_SYMBOL(__check_object_size);
> diff --git a/security/Kconfig b/security/Kconfig
> index 176758cdfa57..df28f2b6f3e1 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -118,6 +118,34 @@ config LSM_MMAP_MIN_ADDR
>  	  this low address space will need the permission specific to the
>  	  systems running LSM.
>
> +config HAVE_HARDENED_USERCOPY_ALLOCATOR
> +	bool
> +	help
> +	  The heap allocator implements __check_heap_object() for
> +	  validating memory ranges against heap object sizes in
> +	  support of CONFIG_HARDENED_USERCOPY.
> +
> +config HAVE_ARCH_HARDENED_USERCOPY
> +	bool
> +	help
> +	  The architecture supports CONFIG_HARDENED_USERCOPY by
> +	  calling check_object_size() just before performing the
> +	  userspace copies in the low level implementation of
> +	  copy_to_user() and copy_from_user().
> +
> +config HARDENED_USERCOPY
> +	bool "Harden memory copies between kernel and userspace"
> +	depends on HAVE_ARCH_HARDENED_USERCOPY
> +	select BUG
> +	help
> +	  This option checks for obviously wrong memory regions when
> +	  copying memory to/from the kernel (via copy_to_user() and
> +	  copy_from_user() functions) by rejecting memory ranges that
> +	  are larger than the specified heap object, span multiple
> +	  separately allocates pages, are not on the process stack,
> +	  or are part of the kernel text. This kills entire classes
> +	  of heap overflow exploits and similar kernel memory exposures.
> +
>  source security/selinux/Kconfig
>  source security/smack/Kconfig
>  source security/tomoyo/Kconfig
>

Christian Borntraeger July 19, 2016, 9:21 a.m. UTC | #2

On 07/15/2016 11:44 PM, Kees Cook wrote:
> +config HAVE_ARCH_LINEAR_KERNEL_MAPPING
> +	bool
> +	help
> +	  An architecture should select this if it has a secondary linear
> +	  mapping of the kernel text. This is used to verify that kernel
> +	  text exposures are not visible under CONFIG_HARDENED_USERCOPY.

I have trouble parsing this. (What does secondary linear mapping mean?)
So let me give an example below

> +
[...]
> +/* Is this address range in the kernel text area? */
> +static inline const char *check_kernel_text_object(const void *ptr,
> +						   unsigned long n)
> +{
> +	unsigned long textlow = (unsigned long)_stext;
> +	unsigned long texthigh = (unsigned long)_etext;
> +
> +	if (overlaps(ptr, n, textlow, texthigh))
> +		return "<kernel text>";
> +
> +#ifdef HAVE_ARCH_LINEAR_KERNEL_MAPPING
> +	/* Check against linear mapping as well. */
> +	if (overlaps(ptr, n, (unsigned long)__va(__pa(textlow)),
> +		     (unsigned long)__va(__pa(texthigh))))
> +		return "<linear kernel text>";
> +#endif
> +
> +	return NULL;
> +}

s390 has an address space for user (primary address space from 0..4TB/8PB) and a separate 
address space (home space from 0..4TB/8PB) for the kernel. In this home space the kernel
mapping is virtual containing the physical memory as well as vmalloc memory (creating aliases
into the physical one). The kernel text is mapped from _stext to _etext in this mapping.
So I assume this would qualify for HAVE_ARCH_LINEAR_KERNEL_MAPPING ?

Kees Cook July 19, 2016, 7:12 p.m. UTC | #3

On Mon, Jul 18, 2016 at 6:52 PM, Laura Abbott <labbott@redhat.com> wrote:
> On 07/15/2016 02:44 PM, Kees Cook wrote:
>>
>> This is the start of porting PAX_USERCOPY into the mainline kernel. This
>> is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
>> work is based on code by PaX Team and Brad Spengler, and an earlier port
>> from Casey Schaufler. Additional non-slab page tests are from Rik van
>> Riel.
>>
>> This patch contains the logic for validating several conditions when
>> performing copy_to_user() and copy_from_user() on the kernel object
>> being copied to/from:
>> - address range doesn't wrap around
>> - address range isn't NULL or zero-allocated (with a non-zero copy size)
>> - if on the slab allocator:
>>   - object size must be less than or equal to copy size (when check is
>>     implemented in the allocator, which appear in subsequent patches)
>> - otherwise, object must not span page allocations
>> - if on the stack
>>   - object must not extend before/after the current process task
>>   - object must be contained by the current stack frame (when there is
>>     arch/build support for identifying stack frames)
>> - object must not overlap with kernel text
>>
>> Signed-off-by: Kees Cook <keescook@chromium.org>
>> Tested-By: Valdis Kletnieks <valdis.kletnieks@vt.edu>
>> Tested-by: Michael Ellerman <mpe@ellerman.id.au>
>> ---
>>  arch/Kconfig                |   7 ++
>>  include/linux/slab.h        |  12 +++
>>  include/linux/thread_info.h |  15 +++
>>  mm/Makefile                 |   4 +
>>  mm/usercopy.c               | 234
>> ++++++++++++++++++++++++++++++++++++++++++++
>>  security/Kconfig            |  28 ++++++
>>  6 files changed, 300 insertions(+)
>>  create mode 100644 mm/usercopy.c
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index 5e2776562035..195ee4cc939a 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -433,6 +433,13 @@ config HAVE_ARCH_WITHIN_STACK_FRAMES
>>           and similar) by implementing an inline
>> arch_within_stack_frames(),
>>           which is used by CONFIG_HARDENED_USERCOPY.
>>
>> +config HAVE_ARCH_LINEAR_KERNEL_MAPPING
>> +       bool
>> +       help
>> +         An architecture should select this if it has a secondary linear
>> +         mapping of the kernel text. This is used to verify that kernel
>> +         text exposures are not visible under CONFIG_HARDENED_USERCOPY.
>> +
>>  config HAVE_CONTEXT_TRACKING
>>         bool
>>         help
>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>> index aeb3e6d00a66..96a16a3fb7cb 100644
>> --- a/include/linux/slab.h
>> +++ b/include/linux/slab.h
>> @@ -155,6 +155,18 @@ void kfree(const void *);
>>  void kzfree(const void *);
>>  size_t ksize(const void *);
>>
>> +#ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
>> +const char *__check_heap_object(const void *ptr, unsigned long n,
>> +                               struct page *page);
>> +#else
>> +static inline const char *__check_heap_object(const void *ptr,
>> +                                             unsigned long n,
>> +                                             struct page *page)
>> +{
>> +       return NULL;
>> +}
>> +#endif
>> +
>>  /*
>>   * Some archs want to perform DMA into kmalloc caches and need a
>> guaranteed
>>   * alignment larger than the alignment of a 64-bit integer.
>> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
>> index 3d5c80b4391d..f24b99eac969 100644
>> --- a/include/linux/thread_info.h
>> +++ b/include/linux/thread_info.h
>> @@ -155,6 +155,21 @@ static inline int arch_within_stack_frames(const void
>> * const stack,
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_HARDENED_USERCOPY
>> +extern void __check_object_size(const void *ptr, unsigned long n,
>> +                                       bool to_user);
>> +
>> +static inline void check_object_size(const void *ptr, unsigned long n,
>> +                                    bool to_user)
>> +{
>> +       __check_object_size(ptr, n, to_user);
>> +}
>> +#else
>> +static inline void check_object_size(const void *ptr, unsigned long n,
>> +                                    bool to_user)
>> +{ }
>> +#endif /* CONFIG_HARDENED_USERCOPY */
>> +
>>  #endif /* __KERNEL__ */
>>
>>  #endif /* _LINUX_THREAD_INFO_H */
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 78c6f7dedb83..32d37247c7e5 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -21,6 +21,9 @@ KCOV_INSTRUMENT_memcontrol.o := n
>>  KCOV_INSTRUMENT_mmzone.o := n
>>  KCOV_INSTRUMENT_vmstat.o := n
>>
>> +# Since __builtin_frame_address does work as used, disable the warning.
>> +CFLAGS_usercopy.o += $(call cc-disable-warning, frame-address)
>> +
>>  mmu-y                  := nommu.o
>>  mmu-$(CONFIG_MMU)      := gup.o highmem.o memory.o mincore.o \
>>                            mlock.o mmap.o mprotect.o mremap.o msync.o
>> rmap.o \
>> @@ -99,3 +102,4 @@ obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>>  obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
>>  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
>>  obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
>> +obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
>> diff --git a/mm/usercopy.c b/mm/usercopy.c
>> new file mode 100644
>> index 000000000000..e4bf4e7ccdf6
>> --- /dev/null
>> +++ b/mm/usercopy.c
>> @@ -0,0 +1,234 @@
>> +/*
>> + * This implements the various checks for CONFIG_HARDENED_USERCOPY*,
>> + * which are designed to protect kernel memory from needless exposure
>> + * and overwrite under many unintended conditions. This code is based
>> + * on PAX_USERCOPY, which is:
>> + *
>> + * Copyright (C) 2001-2016 PaX Team, Bradley Spengler, Open Source
>> + * Security Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + */
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <asm/sections.h>
>> +
>> +enum {
>> +       BAD_STACK = -1,
>> +       NOT_STACK = 0,
>> +       GOOD_FRAME,
>> +       GOOD_STACK,
>> +};
>> +
>> +/*
>> + * Checks if a given pointer and length is contained by the current
>> + * stack frame (if possible).
>> + *
>> + *     0: not at all on the stack
>> + *     1: fully within a valid stack frame
>> + *     2: fully on the stack (when can't do frame-checking)
>> + *     -1: error condition (invalid stack position or bad stack frame)
>> + */
>> +static noinline int check_stack_object(const void *obj, unsigned long
>> len)
>> +{
>> +       const void * const stack = task_stack_page(current);
>> +       const void * const stackend = stack + THREAD_SIZE;
>> +       int ret;
>> +
>> +       /* Object is not on the stack at all. */
>> +       if (obj + len <= stack || stackend <= obj)
>> +               return NOT_STACK;
>> +
>> +       /*
>> +        * Reject: object partially overlaps the stack (passing the
>> +        * the check above means at least one end is within the stack,
>> +        * so if this check fails, the other end is outside the stack).
>> +        */
>> +       if (obj < stack || stackend < obj + len)
>> +               return BAD_STACK;
>> +
>> +       /* Check if object is safely within a valid frame. */
>> +       ret = arch_within_stack_frames(stack, stackend, obj, len);
>> +       if (ret)
>> +               return ret;
>> +
>> +       return GOOD_STACK;
>> +}
>> +
>> +static void report_usercopy(const void *ptr, unsigned long len,
>> +                           bool to_user, const char *type)
>> +{
>> +       pr_emerg("kernel memory %s attempt detected %s %p (%s) (%lu
>> bytes)\n",
>> +               to_user ? "exposure" : "overwrite",
>> +               to_user ? "from" : "to", ptr, type ? : "unknown", len);
>> +       /*
>> +        * For greater effect, it would be nice to do do_group_exit(),
>> +        * but BUG() actually hooks all the lock-breaking and per-arch
>> +        * Oops code, so that is used here instead.
>> +        */
>> +       BUG();
>> +}
>> +
>> +/* Returns true if any portion of [ptr,ptr+n) over laps with [low,high).
>> */
>> +static bool overlaps(const void *ptr, unsigned long n, unsigned long low,
>> +                    unsigned long high)
>> +{
>> +       unsigned long check_low = (uintptr_t)ptr;
>> +       unsigned long check_high = check_low + n;
>> +
>> +       /* Does not overlap if entirely above or entirely below. */
>> +       if (check_low >= high || check_high < low)
>> +               return false;
>> +
>> +       return true;
>> +}
>> +
>> +/* Is this address range in the kernel text area? */
>> +static inline const char *check_kernel_text_object(const void *ptr,
>> +                                                  unsigned long n)
>> +{
>> +       unsigned long textlow = (unsigned long)_stext;
>> +       unsigned long texthigh = (unsigned long)_etext;
>> +
>> +       if (overlaps(ptr, n, textlow, texthigh))
>> +               return "<kernel text>";
>> +
>> +#ifdef HAVE_ARCH_LINEAR_KERNEL_MAPPING
>> +       /* Check against linear mapping as well. */
>> +       if (overlaps(ptr, n, (unsigned long)__va(__pa(textlow)),
>> +                    (unsigned long)__va(__pa(texthigh))))
>> +               return "<linear kernel text>";
>> +#endif
>> +
>> +       return NULL;
>> +}
>> +
>> +static inline const char *check_bogus_address(const void *ptr, unsigned
>> long n)
>> +{
>> +       /* Reject if object wraps past end of memory. */
>> +       if (ptr + n < ptr)
>> +               return "<wrapped address>";
>> +
>> +       /* Reject if NULL or ZERO-allocation. */
>> +       if (ZERO_OR_NULL_PTR(ptr))
>> +               return "<null>";
>> +
>> +       return NULL;
>> +}
>> +
>> +static inline const char *check_heap_object(const void *ptr, unsigned
>> long n,
>> +                                           bool to_user)
>> +{
>> +       struct page *page, *endpage;
>> +       const void *end = ptr + n - 1;
>> +
>> +       if (!virt_addr_valid(ptr))
>> +               return NULL;
>> +
>
>
> virt_addr_valid returns true on vmalloc addresses on arm64 which causes some
> intermittent false positives (tab completion in a qemu buildroot environment
> was showing it fairly reliably). I think this is an arm64 bug because
> virt_addr_valid should return true if and only if virt_to_page returns the
> corresponding page. We can work around this for now by explicitly
> checking against is_vmalloc_addr.

Hrm, that's weird. Sounds like a bug too, but I'll add a check for
is_vmalloc_addr() to catch it for now.

-Kees

>
> Thanks,
> Laura
>
>
>> +       page = virt_to_head_page(ptr);
>> +
>> +       /* Check slab allocator for flags and size. */
>> +       if (PageSlab(page))
>> +               return __check_heap_object(ptr, n, page);
>> +
>> +       /*
>> +        * Sometimes the kernel data regions are not marked Reserved (see
>> +        * check below). And sometimes [_sdata,_edata) does not cover
>> +        * rodata and/or bss, so check each range explicitly.
>> +        */
>> +
>> +       /* Allow reads of kernel rodata region (if not marked as
>> Reserved). */
>> +       if (ptr >= (const void *)__start_rodata &&
>> +           end <= (const void *)__end_rodata) {
>> +               if (!to_user)
>> +                       return "<rodata>";
>> +               return NULL;
>> +       }
>> +
>> +       /* Allow kernel data region (if not marked as Reserved). */
>> +       if (ptr >= (const void *)_sdata && end <= (const void *)_edata)
>> +               return NULL;
>> +
>> +       /* Allow kernel bss region (if not marked as Reserved). */
>> +       if (ptr >= (const void *)__bss_start &&
>> +           end <= (const void *)__bss_stop)
>> +               return NULL;
>> +
>> +       /* Is the object wholly within one base page? */
>> +       if (likely(((unsigned long)ptr & (unsigned long)PAGE_MASK) ==
>> +                  ((unsigned long)end & (unsigned long)PAGE_MASK)))
>> +               return NULL;
>> +
>> +       /* Allow if start and end are inside the same compound page. */
>> +       endpage = virt_to_head_page(end);
>> +       if (likely(endpage == page))
>> +               return NULL;
>> +
>> +       /*
>> +        * Reject if range is not Reserved (i.e. special or device
>> memory),
>> +        * since then the object spans several independently allocated
>> pages.
>> +        */
>> +       for (; ptr <= end ; ptr += PAGE_SIZE, page =
>> virt_to_head_page(ptr)) {
>> +               if (!PageReserved(page))
>> +                       return "<spans multiple pages>";
>> +       }
>> +
>> +       return NULL;
>> +}
>> +
>> +/*
>> + * Validates that the given object is one of:
>> + * - known safe heap object
>> + * - known safe stack object
>> + * - not in kernel text
>> + */
>> +void __check_object_size(const void *ptr, unsigned long n, bool to_user)
>> +{
>> +       const char *err;
>> +
>> +       /* Skip all tests if size is zero. */
>> +       if (!n)
>> +               return;
>> +
>> +       /* Check for invalid addresses. */
>> +       err = check_bogus_address(ptr, n);
>> +       if (err)
>> +               goto report;
>> +
>> +       /* Check for bad heap object. */
>> +       err = check_heap_object(ptr, n, to_user);
>> +       if (err)
>> +               goto report;
>> +
>> +       /* Check for bad stack object. */
>> +       switch (check_stack_object(ptr, n)) {
>> +       case NOT_STACK:
>> +               /* Object is not touching the current process stack. */
>> +               break;
>> +       case GOOD_FRAME:
>> +       case GOOD_STACK:
>> +               /*
>> +                * Object is either in the correct frame (when it
>> +                * is possible to check) or just generally on the
>> +                * process stack (when frame checking not available).
>> +                */
>> +               return;
>> +       default:
>> +               err = "<process stack>";
>> +               goto report;
>> +       }
>> +
>> +       /* Check for object in kernel to avoid text exposure. */
>> +       err = check_kernel_text_object(ptr, n);
>> +       if (!err)
>> +               return;
>> +
>> +report:
>> +       report_usercopy(ptr, n, to_user, err);
>> +}
>> +EXPORT_SYMBOL(__check_object_size);
>> diff --git a/security/Kconfig b/security/Kconfig
>> index 176758cdfa57..df28f2b6f3e1 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -118,6 +118,34 @@ config LSM_MMAP_MIN_ADDR
>>           this low address space will need the permission specific to the
>>           systems running LSM.
>>
>> +config HAVE_HARDENED_USERCOPY_ALLOCATOR
>> +       bool
>> +       help
>> +         The heap allocator implements __check_heap_object() for
>> +         validating memory ranges against heap object sizes in
>> +         support of CONFIG_HARDENED_USERCOPY.
>> +
>> +config HAVE_ARCH_HARDENED_USERCOPY
>> +       bool
>> +       help
>> +         The architecture supports CONFIG_HARDENED_USERCOPY by
>> +         calling check_object_size() just before performing the
>> +         userspace copies in the low level implementation of
>> +         copy_to_user() and copy_from_user().
>> +
>> +config HARDENED_USERCOPY
>> +       bool "Harden memory copies between kernel and userspace"
>> +       depends on HAVE_ARCH_HARDENED_USERCOPY
>> +       select BUG
>> +       help
>> +         This option checks for obviously wrong memory regions when
>> +         copying memory to/from the kernel (via copy_to_user() and
>> +         copy_from_user() functions) by rejecting memory ranges that
>> +         are larger than the specified heap object, span multiple
>> +         separately allocates pages, are not on the process stack,
>> +         or are part of the kernel text. This kills entire classes
>> +         of heap overflow exploits and similar kernel memory exposures.
>> +
>>  source security/selinux/Kconfig
>>  source security/smack/Kconfig
>>  source security/tomoyo/Kconfig
>>
>

Kees Cook July 19, 2016, 7:31 p.m. UTC | #4

On Tue, Jul 19, 2016 at 2:21 AM, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> On 07/15/2016 11:44 PM, Kees Cook wrote:
>> +config HAVE_ARCH_LINEAR_KERNEL_MAPPING
>> +     bool
>> +     help
>> +       An architecture should select this if it has a secondary linear
>> +       mapping of the kernel text. This is used to verify that kernel
>> +       text exposures are not visible under CONFIG_HARDENED_USERCOPY.
>
> I have trouble parsing this. (What does secondary linear mapping mean?)

I likely need help clarifying this language...

> So let me give an example below
>
>> +
> [...]
>> +/* Is this address range in the kernel text area? */
>> +static inline const char *check_kernel_text_object(const void *ptr,
>> +                                                unsigned long n)
>> +{
>> +     unsigned long textlow = (unsigned long)_stext;
>> +     unsigned long texthigh = (unsigned long)_etext;
>> +
>> +     if (overlaps(ptr, n, textlow, texthigh))
>> +             return "<kernel text>";
>> +
>> +#ifdef HAVE_ARCH_LINEAR_KERNEL_MAPPING
>> +     /* Check against linear mapping as well. */
>> +     if (overlaps(ptr, n, (unsigned long)__va(__pa(textlow)),
>> +                  (unsigned long)__va(__pa(texthigh))))
>> +             return "<linear kernel text>";
>> +#endif
>> +
>> +     return NULL;
>> +}
>
> s390 has an address space for user (primary address space from 0..4TB/8PB) and a separate
> address space (home space from 0..4TB/8PB) for the kernel. In this home space the kernel
> mapping is virtual containing the physical memory as well as vmalloc memory (creating aliases
> into the physical one). The kernel text is mapped from _stext to _etext in this mapping.
> So I assume this would qualify for HAVE_ARCH_LINEAR_KERNEL_MAPPING ?

If I understand your example, yes. In the home space you have two
addresses that reference the kernel image? The intent is that if
__va(__pa(_stext)) != _stext, there's a linear mapping of physical
memory in the virtual memory range. On x86_64, the kernel is visible
in two locations in virtual memory. The kernel start in physical
memory address 0x01000000 maps to virtual address 0xffff880001000000,
and the "regular" virtual memory kernel address is at
0xffffffff81000000:

# grep Kernel /proc/iomem
  01000000-01a59767 : Kernel code
  01a59768-0213d77f : Kernel data
  02280000-02fdefff : Kernel bss

# grep startup_64 /proc/kallsyms
ffffffff81000000 T startup_64

# less /sys/kernel/debug/kernel_page_tables
...
---[ Low Kernel Mapping ]---
...
0xffff880001000000-0xffff880001a00000          10M     ro         PSE
 GLB NX pmd
0xffff880001a00000-0xffff880001a5c000         368K     ro   GLB NX pte
0xffff880001a5c000-0xffff880001c00000        1680K     RW   GLB NX pte
...
---[ High Kernel Mapping ]---
...
0xffffffff81000000-0xffffffff81a00000          10M     ro         PSE
 GLB x  pmd
0xffffffff81a00000-0xffffffff81a5c000         368K     ro   GLB x  pte
0xffffffff81a5c000-0xffffffff81c00000        1680K     RW   GLB NX pte
...

I wonder if I can avoid the CONFIG entirely if I just did a
__va(__pa(_stext)) != _stext test... would that break anyone?

-Kees

Christian Borntraeger July 19, 2016, 8:14 p.m. UTC | #5

On 07/19/2016 09:31 PM, Kees Cook wrote:
> On Tue, Jul 19, 2016 at 2:21 AM, Christian Borntraeger
> <borntraeger@de.ibm.com> wrote:
>> On 07/15/2016 11:44 PM, Kees Cook wrote:
>>> +config HAVE_ARCH_LINEAR_KERNEL_MAPPING
>>> +     bool
>>> +     help
>>> +       An architecture should select this if it has a secondary linear
>>> +       mapping of the kernel text. This is used to verify that kernel
>>> +       text exposures are not visible under CONFIG_HARDENED_USERCOPY.
>>
>> I have trouble parsing this. (What does secondary linear mapping mean?)
> 
> I likely need help clarifying this language...
> 
>> So let me give an example below
>>
>>> +
>> [...]
>>> +/* Is this address range in the kernel text area? */
>>> +static inline const char *check_kernel_text_object(const void *ptr,
>>> +                                                unsigned long n)
>>> +{
>>> +     unsigned long textlow = (unsigned long)_stext;
>>> +     unsigned long texthigh = (unsigned long)_etext;
>>> +
>>> +     if (overlaps(ptr, n, textlow, texthigh))
>>> +             return "<kernel text>";
>>> +
>>> +#ifdef HAVE_ARCH_LINEAR_KERNEL_MAPPING
>>> +     /* Check against linear mapping as well. */
>>> +     if (overlaps(ptr, n, (unsigned long)__va(__pa(textlow)),
>>> +                  (unsigned long)__va(__pa(texthigh))))
>>> +             return "<linear kernel text>";
>>> +#endif
>>> +
>>> +     return NULL;
>>> +}
>>
>> s390 has an address space for user (primary address space from 0..4TB/8PB) and a separate
>> address space (home space from 0..4TB/8PB) for the kernel. In this home space the kernel
>> mapping is virtual containing the physical memory as well as vmalloc memory (creating aliases
>> into the physical one). The kernel text is mapped from _stext to _etext in this mapping.
>> So I assume this would qualify for HAVE_ARCH_LINEAR_KERNEL_MAPPING ?
> 
> If I understand your example, yes. In the home space you have two
> addresses that reference the kernel image?

No, there is only one address that points to the kernel.
As we have no kernel ASLR yet, and the kernel mapping is 
a 1:1 mapping from 0 to memory end and the kernel is only
from _stext to _etext. The vmalloc area contains modules
and vmalloc but not a 2nd kernel mapping.

But thanks for your example, now I understood. If we have only
one address 
>>> +     if (overlaps(ptr, n, textlow, texthigh))
>>> +             return "<kernel text>";

This is just enough.

So what about for the CONFIG text:

       An architecture should select this if the kernel mapping has a secondary
       linear mapping of the kernel text - in other words more than one virtual
       kernel address that points to the kernel image. This is used to verify 
       that kernel text exposures are not visible under CONFIG_HARDENED_USERCOPY.


> I wonder if I can avoid the CONFIG entirely if I just did a
> __va(__pa(_stext)) != _stext test... would that break anyone?

Can this be resolved on all platforms at compile time?

Kees Cook July 19, 2016, 8:34 p.m. UTC | #6

On Tue, Jul 19, 2016 at 1:14 PM, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> On 07/19/2016 09:31 PM, Kees Cook wrote:
>> On Tue, Jul 19, 2016 at 2:21 AM, Christian Borntraeger
>> <borntraeger@de.ibm.com> wrote:
>>> On 07/15/2016 11:44 PM, Kees Cook wrote:
>>>> +config HAVE_ARCH_LINEAR_KERNEL_MAPPING
>>>> +     bool
>>>> +     help
>>>> +       An architecture should select this if it has a secondary linear
>>>> +       mapping of the kernel text. This is used to verify that kernel
>>>> +       text exposures are not visible under CONFIG_HARDENED_USERCOPY.
>>>
>>> I have trouble parsing this. (What does secondary linear mapping mean?)
>>
>> I likely need help clarifying this language...
>>
>>> So let me give an example below
>>>
>>>> +
>>> [...]
>>>> +/* Is this address range in the kernel text area? */
>>>> +static inline const char *check_kernel_text_object(const void *ptr,
>>>> +                                                unsigned long n)
>>>> +{
>>>> +     unsigned long textlow = (unsigned long)_stext;
>>>> +     unsigned long texthigh = (unsigned long)_etext;
>>>> +
>>>> +     if (overlaps(ptr, n, textlow, texthigh))
>>>> +             return "<kernel text>";
>>>> +
>>>> +#ifdef HAVE_ARCH_LINEAR_KERNEL_MAPPING
>>>> +     /* Check against linear mapping as well. */
>>>> +     if (overlaps(ptr, n, (unsigned long)__va(__pa(textlow)),
>>>> +                  (unsigned long)__va(__pa(texthigh))))
>>>> +             return "<linear kernel text>";
>>>> +#endif
>>>> +
>>>> +     return NULL;
>>>> +}
>>>
>>> s390 has an address space for user (primary address space from 0..4TB/8PB) and a separate
>>> address space (home space from 0..4TB/8PB) for the kernel. In this home space the kernel
>>> mapping is virtual containing the physical memory as well as vmalloc memory (creating aliases
>>> into the physical one). The kernel text is mapped from _stext to _etext in this mapping.
>>> So I assume this would qualify for HAVE_ARCH_LINEAR_KERNEL_MAPPING ?
>>
>> If I understand your example, yes. In the home space you have two
>> addresses that reference the kernel image?
>
> No, there is only one address that points to the kernel.
> As we have no kernel ASLR yet, and the kernel mapping is
> a 1:1 mapping from 0 to memory end and the kernel is only
> from _stext to _etext. The vmalloc area contains modules
> and vmalloc but not a 2nd kernel mapping.
>
> But thanks for your example, now I understood. If we have only
> one address
>>>> +     if (overlaps(ptr, n, textlow, texthigh))
>>>> +             return "<kernel text>";
>
> This is just enough.
>
> So what about for the CONFIG text:
>
>        An architecture should select this if the kernel mapping has a secondary
>        linear mapping of the kernel text - in other words more than one virtual
>        kernel address that points to the kernel image. This is used to verify
>        that kernel text exposures are not visible under CONFIG_HARDENED_USERCOPY.

Sounds good, I've adjusted it for now.

>> I wonder if I can avoid the CONFIG entirely if I just did a
>> __va(__pa(_stext)) != _stext test... would that break anyone?
>
> Can this be resolved on all platforms at compile time?

Well, I think it still needs a runtime check (compile-time may not be
able to tell about kaslr, or who knows what else). I would really like
to avoid the CONFIG if possible, though. Would this do the right thing
on s390? This appears to work where I'm able to test it (32/64 x86,
32/64 arm):

        unsigned long textlow = (unsigned long)_stext;
        unsigned long texthigh = (unsigned long)_etext;
        unsigned long textlow_linear = (unsigned long)__va(__pa(textlow);
        unsigned long texthigh_linear = (unsigned long)__va(__pa(texthigh);

        if (overlaps(ptr, n, textlow, texthigh))
                return "<kernel text>";

        /* Check against possible secondary linear mapping as well. */
        if (textlow != textlow_linear &&
            overlaps(ptr, n, textlow_linear, texthigh_linear))
                return "<linear kernel text>";

        return NULL;


-Kees

Christian Borntraeger July 19, 2016, 8:44 p.m. UTC | #7

On 07/19/2016 10:34 PM, Kees Cook wrote:
[...]
>>
>> So what about for the CONFIG text:
>>
>>        An architecture should select this if the kernel mapping has a secondary
>>        linear mapping of the kernel text - in other words more than one virtual
>>        kernel address that points to the kernel image. This is used to verify
>>        that kernel text exposures are not visible under CONFIG_HARDENED_USERCOPY.
> 
> Sounds good, I've adjusted it for now.
> 
>>> I wonder if I can avoid the CONFIG entirely if I just did a
>>> __va(__pa(_stext)) != _stext test... would that break anyone?
>>
>> Can this be resolved on all platforms at compile time?
> 
> Well, I think it still needs a runtime check (compile-time may not be
> able to tell about kaslr, or who knows what else). I would really like
> to avoid the CONFIG if possible, though. Would this do the right thing
> on s390? This appears to work where I'm able to test it (32/64 x86,
> 32/64 arm):
> 
>         unsigned long textlow = (unsigned long)_stext;
>         unsigned long texthigh = (unsigned long)_etext;
>         unsigned long textlow_linear = (unsigned long)__va(__pa(textlow);
>         unsigned long texthigh_linear = (unsigned long)__va(__pa(texthigh);
> 
as we have

#define PAGE_OFFSET             0x0UL
#define __pa(x)                 (unsigned long)(x)
#define __va(x)                 (void *)(unsigned long)(x)

both should be identical on s390 as of today, so it should work fine and only
do the check once

>         if (overlaps(ptr, n, textlow, texthigh))
>                 return "<kernel text>";
> 
>         /* Check against possible secondary linear mapping as well. */
>         if (textlow != textlow_linear &&
>             overlaps(ptr, n, textlow_linear, texthigh_linear))
>                 return "<linear kernel text>";
> 
>         return NULL;
> 
> 
> -Kees
> 


PS: Not sure how useful and flexible this offers is but you can get some temporary
free access to an s390 on https://developer.ibm.com/linuxone/

Kees Cook July 19, 2016, 10:55 p.m. UTC | #8

On Tue, Jul 19, 2016 at 12:12 PM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jul 18, 2016 at 6:52 PM, Laura Abbott <labbott@redhat.com> wrote:
>> On 07/15/2016 02:44 PM, Kees Cook wrote:
>>> +static inline const char *check_heap_object(const void *ptr, unsigned
>>> long n,
>>> +                                           bool to_user)
>>> +{
>>> +       struct page *page, *endpage;
>>> +       const void *end = ptr + n - 1;
>>> +
>>> +       if (!virt_addr_valid(ptr))
>>> +               return NULL;
>>> +
>>
>>
>> virt_addr_valid returns true on vmalloc addresses on arm64 which causes some
>> intermittent false positives (tab completion in a qemu buildroot environment
>> was showing it fairly reliably). I think this is an arm64 bug because
>> virt_addr_valid should return true if and only if virt_to_page returns the
>> corresponding page. We can work around this for now by explicitly
>> checking against is_vmalloc_addr.
>
> Hrm, that's weird. Sounds like a bug too, but I'll add a check for
> is_vmalloc_addr() to catch it for now.

BTW, if you were testing against -next, KASAN moved things around in
copy_*_user() in a way I wasn't expecting (__copy* and copy* now both
call __arch_copy* instead of copy* calling __copy*). I'll have this
fixed in the next version.

-Kees

Michael Ellerman July 21, 2016, 6:52 a.m. UTC | #9

Kees Cook <keescook@chromium.org> writes:

> diff --git a/mm/usercopy.c b/mm/usercopy.c
> new file mode 100644
> index 000000000000..e4bf4e7ccdf6
> --- /dev/null
> +++ b/mm/usercopy.c
> @@ -0,0 +1,234 @@
...
> +
> +/*
> + * Checks if a given pointer and length is contained by the current
> + * stack frame (if possible).
> + *
> + *	0: not at all on the stack
> + *	1: fully within a valid stack frame
> + *	2: fully on the stack (when can't do frame-checking)
> + *	-1: error condition (invalid stack position or bad stack frame)
> + */
> +static noinline int check_stack_object(const void *obj, unsigned long len)
> +{
> +	const void * const stack = task_stack_page(current);
> +	const void * const stackend = stack + THREAD_SIZE;

That allows access to the entire stack, including the struct thread_info,
is that what we want - it seems dangerous? Or did I miss a check
somewhere else?

We have end_of_stack() which computes the end of the stack taking
thread_info into account (end being the opposite of your end above).

cheers

Kees Cook July 21, 2016, 6:34 p.m. UTC | #10

On Wed, Jul 20, 2016 at 11:52 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> diff --git a/mm/usercopy.c b/mm/usercopy.c
>> new file mode 100644
>> index 000000000000..e4bf4e7ccdf6
>> --- /dev/null
>> +++ b/mm/usercopy.c
>> @@ -0,0 +1,234 @@
> ...
>> +
>> +/*
>> + * Checks if a given pointer and length is contained by the current
>> + * stack frame (if possible).
>> + *
>> + *   0: not at all on the stack
>> + *   1: fully within a valid stack frame
>> + *   2: fully on the stack (when can't do frame-checking)
>> + *   -1: error condition (invalid stack position or bad stack frame)
>> + */
>> +static noinline int check_stack_object(const void *obj, unsigned long len)
>> +{
>> +     const void * const stack = task_stack_page(current);
>> +     const void * const stackend = stack + THREAD_SIZE;
>
> That allows access to the entire stack, including the struct thread_info,
> is that what we want - it seems dangerous? Or did I miss a check
> somewhere else?

That seems like a nice improvement to make, yeah.

> We have end_of_stack() which computes the end of the stack taking
> thread_info into account (end being the opposite of your end above).

Amusingly, the object_is_on_stack() check in sched.h doesn't take
thread_info into account either. :P Regardless, I think using
end_of_stack() may not be best. To tighten the check, I think we could
add this after checking that the object is on the stack:

#ifdef CONFIG_STACK_GROWSUP
        stackend -= sizeof(struct thread_info);
#else
        stack += sizeof(struct thread_info);
#endif

e.g. then if the pointer was in the thread_info, the second test would
fail, triggering the protection.

-Kees

Josh Poimboeuf July 22, 2016, 5:45 p.m. UTC | #11

On Thu, Jul 21, 2016 at 11:34:25AM -0700, Kees Cook wrote:
> On Wed, Jul 20, 2016 at 11:52 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> > Kees Cook <keescook@chromium.org> writes:
> >
> >> diff --git a/mm/usercopy.c b/mm/usercopy.c
> >> new file mode 100644
> >> index 000000000000..e4bf4e7ccdf6
> >> --- /dev/null
> >> +++ b/mm/usercopy.c
> >> @@ -0,0 +1,234 @@
> > ...
> >> +
> >> +/*
> >> + * Checks if a given pointer and length is contained by the current
> >> + * stack frame (if possible).
> >> + *
> >> + *   0: not at all on the stack
> >> + *   1: fully within a valid stack frame
> >> + *   2: fully on the stack (when can't do frame-checking)
> >> + *   -1: error condition (invalid stack position or bad stack frame)
> >> + */
> >> +static noinline int check_stack_object(const void *obj, unsigned long len)
> >> +{
> >> +     const void * const stack = task_stack_page(current);
> >> +     const void * const stackend = stack + THREAD_SIZE;
> >
> > That allows access to the entire stack, including the struct thread_info,
> > is that what we want - it seems dangerous? Or did I miss a check
> > somewhere else?
> 
> That seems like a nice improvement to make, yeah.
> 
> > We have end_of_stack() which computes the end of the stack taking
> > thread_info into account (end being the opposite of your end above).
> 
> Amusingly, the object_is_on_stack() check in sched.h doesn't take
> thread_info into account either. :P Regardless, I think using
> end_of_stack() may not be best. To tighten the check, I think we could
> add this after checking that the object is on the stack:
> 
> #ifdef CONFIG_STACK_GROWSUP
>         stackend -= sizeof(struct thread_info);
> #else
>         stack += sizeof(struct thread_info);
> #endif
> 
> e.g. then if the pointer was in the thread_info, the second test would
> fail, triggering the protection.

FWIW, this won't work right on x86 after Andy's
CONFIG_THREAD_INFO_IN_TASK patches get merged.

David Laight July 25, 2016, 9:27 a.m. UTC | #12

From: Josh Poimboeuf

> Sent: 22 July 2016 18:46

..
> > >> +/*

> > >> + * Checks if a given pointer and length is contained by the current

> > >> + * stack frame (if possible).

> > >> + *

> > >> + *   0: not at all on the stack

> > >> + *   1: fully within a valid stack frame

> > >> + *   2: fully on the stack (when can't do frame-checking)

> > >> + *   -1: error condition (invalid stack position or bad stack frame)

> > >> + */

> > >> +static noinline int check_stack_object(const void *obj, unsigned long len)

> > >> +{

> > >> +     const void * const stack = task_stack_page(current);

> > >> +     const void * const stackend = stack + THREAD_SIZE;

> > >

> > > That allows access to the entire stack, including the struct thread_info,

> > > is that what we want - it seems dangerous? Or did I miss a check

> > > somewhere else?

> >

> > That seems like a nice improvement to make, yeah.

> >

> > > We have end_of_stack() which computes the end of the stack taking

> > > thread_info into account (end being the opposite of your end above).

> >

> > Amusingly, the object_is_on_stack() check in sched.h doesn't take

> > thread_info into account either. :P Regardless, I think using

> > end_of_stack() may not be best. To tighten the check, I think we could

> > add this after checking that the object is on the stack:

> >

> > #ifdef CONFIG_STACK_GROWSUP

> >         stackend -= sizeof(struct thread_info);

> > #else

> >         stack += sizeof(struct thread_info);

> > #endif

> >

> > e.g. then if the pointer was in the thread_info, the second test would

> > fail, triggering the protection.

> 

> FWIW, this won't work right on x86 after Andy's

> CONFIG_THREAD_INFO_IN_TASK patches get merged.


What ends up in the 'thread_info' area?
If it contains the fp save area then programs like gdb may end up requesting
copy_in/out directly from that area.

Interestingly the avx registers don't need saving on a normal system call
entry (they are all caller-saved) so the kernel stack can safely overwrite
that area.
Syscall entry probably ought to execute the 'zero all avx registers' instruction.
They do need saving on interrupt entry - but the stack used will be less.

	David

Michael Ellerman July 26, 2016, 2:03 a.m. UTC | #13

Josh Poimboeuf <jpoimboe@redhat.com> writes:

> On Thu, Jul 21, 2016 at 11:34:25AM -0700, Kees Cook wrote:
>> On Wed, Jul 20, 2016 at 11:52 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
>> > Kees Cook <keescook@chromium.org> writes:
>> >
>> >> diff --git a/mm/usercopy.c b/mm/usercopy.c
>> >> new file mode 100644
>> >> index 000000000000..e4bf4e7ccdf6
>> >> --- /dev/null
>> >> +++ b/mm/usercopy.c
>> >> @@ -0,0 +1,234 @@
>> > ...
>> >> +
>> >> +/*
>> >> + * Checks if a given pointer and length is contained by the current
>> >> + * stack frame (if possible).
>> >> + *
>> >> + *   0: not at all on the stack
>> >> + *   1: fully within a valid stack frame
>> >> + *   2: fully on the stack (when can't do frame-checking)
>> >> + *   -1: error condition (invalid stack position or bad stack frame)
>> >> + */
>> >> +static noinline int check_stack_object(const void *obj, unsigned long len)
>> >> +{
>> >> +     const void * const stack = task_stack_page(current);
>> >> +     const void * const stackend = stack + THREAD_SIZE;
>> >
>> > That allows access to the entire stack, including the struct thread_info,
>> > is that what we want - it seems dangerous? Or did I miss a check
>> > somewhere else?
>> 
>> That seems like a nice improvement to make, yeah.
>> 
>> > We have end_of_stack() which computes the end of the stack taking
>> > thread_info into account (end being the opposite of your end above).
>> 
>> Amusingly, the object_is_on_stack() check in sched.h doesn't take
>> thread_info into account either. :P Regardless, I think using
>> end_of_stack() may not be best. To tighten the check, I think we could
>> add this after checking that the object is on the stack:
>> 
>> #ifdef CONFIG_STACK_GROWSUP
>>         stackend -= sizeof(struct thread_info);
>> #else
>>         stack += sizeof(struct thread_info);
>> #endif
>> 
>> e.g. then if the pointer was in the thread_info, the second test would
>> fail, triggering the protection.
>
> FWIW, this won't work right on x86 after Andy's
> CONFIG_THREAD_INFO_IN_TASK patches get merged.

Yeah. I wonder if it's better for the arch helper to just take the obj and len,
and work out it's own bounds for the stack using current and whatever makes
sense on that arch.

It would avoid too much ifdefery in the generic code, and also avoid any
confusion about whether stackend is the high or low address.

eg. on powerpc we could do:

int noinline arch_within_stack_frames(const void *obj, unsigned long len)
{
	void *stack_low  = end_of_stack(current);
	void *stack_high = task_stack_page(current) + THREAD_SIZE;


Whereas arches with STACK_GROWSUP=y could do roughly the reverse, and x86 can do
whatever it needs to depending on whether the thread_info is on or off stack.

cheers

Michael Ellerman July 26, 2016, 2:09 a.m. UTC | #14

David Laight <David.Laight@ACULAB.COM> writes:

> From: Josh Poimboeuf
>> Sent: 22 July 2016 18:46
>> >
>> > e.g. then if the pointer was in the thread_info, the second test would
>> > fail, triggering the protection.
>> 
>> FWIW, this won't work right on x86 after Andy's
>> CONFIG_THREAD_INFO_IN_TASK patches get merged.
>
> What ends up in the 'thread_info' area?

It depends on the arch.

> If it contains the fp save area then programs like gdb may end up requesting
> copy_in/out directly from that area.

On the arches I've seen thread_info doesn't usually contain register save areas,
but if it did then it would be up to the arch helper to allow that copy to go
through.

However given thread_info generally contains lots of low level flags that would
be a good target for an attacker, the best way to cope with ptrace wanting to
copy to/from it would be to use a temporary, and prohibit copying directly
to/from thread_info - IMHO.

cheers

Kees Cook July 26, 2016, 4:46 a.m. UTC | #15

On Mon, Jul 25, 2016 at 7:03 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> Josh Poimboeuf <jpoimboe@redhat.com> writes:
>
>> On Thu, Jul 21, 2016 at 11:34:25AM -0700, Kees Cook wrote:
>>> On Wed, Jul 20, 2016 at 11:52 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
>>> > Kees Cook <keescook@chromium.org> writes:
>>> >
>>> >> diff --git a/mm/usercopy.c b/mm/usercopy.c
>>> >> new file mode 100644
>>> >> index 000000000000..e4bf4e7ccdf6
>>> >> --- /dev/null
>>> >> +++ b/mm/usercopy.c
>>> >> @@ -0,0 +1,234 @@
>>> > ...
>>> >> +
>>> >> +/*
>>> >> + * Checks if a given pointer and length is contained by the current
>>> >> + * stack frame (if possible).
>>> >> + *
>>> >> + *   0: not at all on the stack
>>> >> + *   1: fully within a valid stack frame
>>> >> + *   2: fully on the stack (when can't do frame-checking)
>>> >> + *   -1: error condition (invalid stack position or bad stack frame)
>>> >> + */
>>> >> +static noinline int check_stack_object(const void *obj, unsigned long len)
>>> >> +{
>>> >> +     const void * const stack = task_stack_page(current);
>>> >> +     const void * const stackend = stack + THREAD_SIZE;
>>> >
>>> > That allows access to the entire stack, including the struct thread_info,
>>> > is that what we want - it seems dangerous? Or did I miss a check
>>> > somewhere else?
>>>
>>> That seems like a nice improvement to make, yeah.
>>>
>>> > We have end_of_stack() which computes the end of the stack taking
>>> > thread_info into account (end being the opposite of your end above).
>>>
>>> Amusingly, the object_is_on_stack() check in sched.h doesn't take
>>> thread_info into account either. :P Regardless, I think using
>>> end_of_stack() may not be best. To tighten the check, I think we could
>>> add this after checking that the object is on the stack:
>>>
>>> #ifdef CONFIG_STACK_GROWSUP
>>>         stackend -= sizeof(struct thread_info);
>>> #else
>>>         stack += sizeof(struct thread_info);
>>> #endif
>>>
>>> e.g. then if the pointer was in the thread_info, the second test would
>>> fail, triggering the protection.
>>
>> FWIW, this won't work right on x86 after Andy's
>> CONFIG_THREAD_INFO_IN_TASK patches get merged.
>
> Yeah. I wonder if it's better for the arch helper to just take the obj and len,
> and work out it's own bounds for the stack using current and whatever makes
> sense on that arch.
>
> It would avoid too much ifdefery in the generic code, and also avoid any
> confusion about whether stackend is the high or low address.
>
> eg. on powerpc we could do:
>
> int noinline arch_within_stack_frames(const void *obj, unsigned long len)
> {
>         void *stack_low  = end_of_stack(current);
>         void *stack_high = task_stack_page(current) + THREAD_SIZE;
>
>
> Whereas arches with STACK_GROWSUP=y could do roughly the reverse, and x86 can do
> whatever it needs to depending on whether the thread_info is on or off stack.
>
> cheers

Yeah, I agree: this should be in the arch code. If the arch can
actually do frame checking, the thread_info (if it exists on the
stack) would already be excluded. But it'd be a nice tightening of the
check.

-Kees

[v3,02/11] mm: Hardened usercopy

Commit Message

Comments

Patch