diff mbox

[RFC,5/8] ARM: Implement __arch_rare_write_map/unmap()

Message ID 1488228186-110679-6-git-send-email-keescook@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Kees Cook Feb. 27, 2017, 8:43 p.m. UTC
Based on grsecurity's ARM pax_{open,close}_kernel() implementation, this
allows HAVE_ARCH_RARE_WRITE to work on ARM.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/arm/Kconfig               |  1 +
 arch/arm/include/asm/domain.h  |  3 ++-
 arch/arm/include/asm/pgtable.h | 27 +++++++++++++++++++++++++++
 3 files changed, 30 insertions(+), 1 deletion(-)

Comments

Russell King (Oracle) March 1, 2017, 1:04 a.m. UTC | #1
On Mon, Feb 27, 2017 at 12:43:03PM -0800, Kees Cook wrote:
> Based on grsecurity's ARM pax_{open,close}_kernel() implementation, this
> allows HAVE_ARCH_RARE_WRITE to work on ARM.

This has the effect that any memory mapped with DOMAIN_KERNEL will
loose it's NX status, and may end up being read into the I-cache.

We used to do exactly this to support set_fs(KERNEL_DS) but it was
deemed to be way too problematical (for speculative prefetching)
to use it on ARMv6+.

As vmalloc space ends up with a random mixture of DOMAIN_KERNEL and
DOMAIN_IO mappings (due to the order of ioremap() vs vmalloc()), this
means DOMAIN_KERNEL can cover devices... which with switching
DOMAIN_KERNEL to manager mode result in the NX being removed for
device mappings, which (iirc) is unpredictable.
Kees Cook March 1, 2017, 5:41 a.m. UTC | #2
On Tue, Feb 28, 2017 at 5:04 PM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Mon, Feb 27, 2017 at 12:43:03PM -0800, Kees Cook wrote:
>> Based on grsecurity's ARM pax_{open,close}_kernel() implementation, this
>> allows HAVE_ARCH_RARE_WRITE to work on ARM.
>
> This has the effect that any memory mapped with DOMAIN_KERNEL will
> loose it's NX status, and may end up being read into the I-cache.

Arbitrarily so, or only memory accessed/pre-fetched by the CPU when in
this state? i.e. since this is non-preempt, only touches the needed
memory, and has the original domain state restored within a few
instructions, does this avoid the problem? It seems like the chance
for a speculative prefetch from device memory under these conditions
should be approaching zero.

> We used to do exactly this to support set_fs(KERNEL_DS) but it was
> deemed to be way too problematical (for speculative prefetching)
> to use it on ARMv6+.
>
> As vmalloc space ends up with a random mixture of DOMAIN_KERNEL and
> DOMAIN_IO mappings (due to the order of ioremap() vs vmalloc()), this
> means DOMAIN_KERNEL can cover devices... which with switching
> DOMAIN_KERNEL to manager mode result in the NX being removed for
> device mappings, which (iirc) is unpredictable.

Just to make sure I understand: it was only speculative prefetch vs
icache, right? Would an icache flush restore the correct permissions?
I'm just pondering alternatives. Also, is there a maximum distance the
prefetching spans? i.e. could device memory be guaranteed to be
vmapped far enough away from kernel memory to avoid prefetches?

-Kees
Russell King (Oracle) March 1, 2017, 11:30 a.m. UTC | #3
On Tue, Feb 28, 2017 at 09:41:07PM -0800, Kees Cook wrote:
> On Tue, Feb 28, 2017 at 5:04 PM, Russell King - ARM Linux
> <linux@armlinux.org.uk> wrote:
> > On Mon, Feb 27, 2017 at 12:43:03PM -0800, Kees Cook wrote:
> >> Based on grsecurity's ARM pax_{open,close}_kernel() implementation, this
> >> allows HAVE_ARCH_RARE_WRITE to work on ARM.
> >
> > This has the effect that any memory mapped with DOMAIN_KERNEL will
> > loose it's NX status, and may end up being read into the I-cache.
> 
> Arbitrarily so, or only memory accessed/pre-fetched by the CPU when in
> this state? i.e. since this is non-preempt, only touches the needed
> memory, and has the original domain state restored within a few
> instructions, does this avoid the problem? It seems like the chance
> for a speculative prefetch from device memory under these conditions
> should be approaching zero.

"The software that defines a translation table must mark any region of
 memory that is read-sensitive as execute-never, to avoid the possibility
 of a speculative fetch accessing the memory region. For example, it must
 mark any memory region that corresponds to a read-sensitive peripheral
 as Execute-never."

Also see:

commit 247055aa21ffef1c49dd64710d5e94c2aee19b58
Author: Catalin Marinas <catalin.marinas@arm.com>
Date:   Mon Sep 13 16:03:21 2010 +0100

    ARM: 6384/1: Remove the domain switching on ARMv6k/v7 CPUs

which removed the domain switching I referred to previously.

The way the ARM ARM looks at instruction speculative prefetch is that it
can happen to any location that is not explicitly marked as Execute-never.
(This is because the ARM ARM doesn't define an implementation.)  So, we
have to assume that any location that is not marked XN may be speculatively
prefetched by the processor.

Device memory can be read-sensitive - eg, reading an interrupt status
register can clear the ending interrupt bits.

A speculative prefetch is a read as far as a device is concerned, so
bypassing the XN permission by switching the domain to manager mode has
the effect that the processor can then _legally_ speculatively prefetch
from a device, and if it happens to hit a device that contains a read
sensitive location, the side effects of reading that location will
happen, even though the program did not perform an explicit read.

> Just to make sure I understand: it was only speculative prefetch vs
> icache, right? Would an icache flush restore the correct permissions?

It's not about permissions, it's about the side effects at the device
of a read created by the speculative prefetch.

> I'm just pondering alternatives. Also, is there a maximum distance the
> prefetching spans? i.e. could device memory be guaranteed to be
> vmapped far enough away from kernel memory to avoid prefetches?

The root cause of this problem is the way we lump both vmalloc() and
ioremap() mappings into the same memory space (vmalloc region) without
caring about the domain.

If all device memory was guaranteed to be placed under a different
domain, then this problem would not exist.  In order to achieve that,
there's several ways I can think of doing it:

1) Have separate virtual memory regions for ioremap() and vmalloc()
   We would need to choose an arbitary limit on the size of these
   memory pools, which may not suit everyone.

2) Have vmalloc() grow up as a heap, ioremap() grow down as a stack
   and a dynamic boundary (aligned to 1 or 2MB) between the two, no
   mixing allowed.  This avoids the problem with (1) but still results
   in the required separation.

3) Align vmalloc region allocations to 2MB, but this would be very
   wasteful.

4) Only permit same type (ioremap/vmalloc) of mapping within a 2MB block
   of vmalloc space.  In other words, a primary allocator of 2MB blocks
   and a sub-allocator of page-sized blocks (think of the way our
   page allocator vs slab works.)  Probably going to be subject to
   fragmentation problems.

5) Place all vmalloc() and ioremap() mappings under a separate domain,
   so that all these mappings would be unaffected by the change of
   domain settings (the resulting permissions would never change.)
   In other words, DOMAIN_IO becomes DOMAIN_VMALLOC and is used for all
   mappings in vmalloc space.

The problem with (2) and (5) is teaching pte_alloc_kernel() down to
pmd_populate_kernel() about the differences - currently, this only ever
sets up DOMAIN_KERNEL mappings because there's no way for it to know
what kind of mapping is required.
Mark Rutland March 1, 2017, 11:50 a.m. UTC | #4
Hi,

On Tue, Feb 28, 2017 at 09:41:07PM -0800, Kees Cook wrote:
> On Tue, Feb 28, 2017 at 5:04 PM, Russell King - ARM Linux
> <linux@armlinux.org.uk> wrote:
> > On Mon, Feb 27, 2017 at 12:43:03PM -0800, Kees Cook wrote:
> >> Based on grsecurity's ARM pax_{open,close}_kernel() implementation, this
> >> allows HAVE_ARCH_RARE_WRITE to work on ARM.
> >
> > This has the effect that any memory mapped with DOMAIN_KERNEL will
> > loose it's NX status, and may end up being read into the I-cache.
> 
> Arbitrarily so, or only memory accessed/pre-fetched by the CPU when in
> this state? 

While the MMU says the VA is executable, though "while" is difficult to
define, see below.

> i.e. since this is non-preempt, only touches the needed memory, and
> has the original domain state restored within a few instructions, does
> this avoid the problem? It seems like the chance for a speculative
> prefetch from device memory under these conditions should be
> approaching zero.

It's entirely possible for the I-cache to fetch within this window, even
if unlikely.

It is a bug to map devices without NX (on cores that have it), even
instantaneously.

You can't reason about this in terms of number of instructions, since
bits of the CPU can operate asynchronously anyhow. For example, the CPU
might stall immediately after the domain switch, fetching the next
instruction, and in the mean time the I-cache decides to send of
requests for some other arbitrary locations while waiting for a
response.

> > We used to do exactly this to support set_fs(KERNEL_DS) but it was
> > deemed to be way too problematical (for speculative prefetching)
> > to use it on ARMv6+.
> >
> > As vmalloc space ends up with a random mixture of DOMAIN_KERNEL and
> > DOMAIN_IO mappings (due to the order of ioremap() vs vmalloc()), this
> > means DOMAIN_KERNEL can cover devices... which with switching
> > DOMAIN_KERNEL to manager mode result in the NX being removed for
> > device mappings, which (iirc) is unpredictable.
> 
> Just to make sure I understand: it was only speculative prefetch vs
> icache, right? Would an icache flush restore the correct permissions?

The problem is that the fetch itself can be destructive. It can change
the state of a device (see below for an example), or trigger
(asynchronous) errors from the endpoint or interconnect.

No amount of cache maintenance can avoid this.

> I'm just pondering alternatives. Also, is there a maximum distance the
> prefetching spans? i.e. could device memory be guaranteed to be
> vmapped far enough away from kernel memory to avoid prefetches?

There is no practical limitation. The architecture permits a CPU's
I-cache to fetch from any mapping which does not have NX, at any point
in time that mapping is live, for any reason it sees fit.

For example, see commit b6ccb9803e90c16b ("ARM: 7954/1: mm: remove
remaining domain support from ARMv6").

In that case, while executing some kernel code (e.g. the sys_exec()
path), Cortex-A15's instruction fetch would occasionally fetch from the
GIC, ACKing interrupts in the process.

The only solution is to never map devices without NX.

Thanks,
Mark.
Kees Cook March 2, 2017, 12:08 a.m. UTC | #5
On Wed, Mar 1, 2017 at 3:30 AM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Tue, Feb 28, 2017 at 09:41:07PM -0800, Kees Cook wrote:
>> On Tue, Feb 28, 2017 at 5:04 PM, Russell King - ARM Linux
>> <linux@armlinux.org.uk> wrote:
>> > On Mon, Feb 27, 2017 at 12:43:03PM -0800, Kees Cook wrote:
>> >> Based on grsecurity's ARM pax_{open,close}_kernel() implementation, this
>> >> allows HAVE_ARCH_RARE_WRITE to work on ARM.
>> >
>> > This has the effect that any memory mapped with DOMAIN_KERNEL will
>> > loose it's NX status, and may end up being read into the I-cache.
>>
>> Arbitrarily so, or only memory accessed/pre-fetched by the CPU when in
>> this state? i.e. since this is non-preempt, only touches the needed
>> memory, and has the original domain state restored within a few
>> instructions, does this avoid the problem? It seems like the chance
>> for a speculative prefetch from device memory under these conditions
>> should be approaching zero.
>
> "The software that defines a translation table must mark any region of
>  memory that is read-sensitive as execute-never, to avoid the possibility
>  of a speculative fetch accessing the memory region. For example, it must
>  mark any memory region that corresponds to a read-sensitive peripheral
>  as Execute-never."
>
> Also see:
>
> commit 247055aa21ffef1c49dd64710d5e94c2aee19b58
> Author: Catalin Marinas <catalin.marinas@arm.com>
> Date:   Mon Sep 13 16:03:21 2010 +0100
>
>     ARM: 6384/1: Remove the domain switching on ARMv6k/v7 CPUs
>
> which removed the domain switching I referred to previously.
>
> The way the ARM ARM looks at instruction speculative prefetch is that it
> can happen to any location that is not explicitly marked as Execute-never.
> (This is because the ARM ARM doesn't define an implementation.)  So, we
> have to assume that any location that is not marked XN may be speculatively
> prefetched by the processor.
>
> Device memory can be read-sensitive - eg, reading an interrupt status
> register can clear the ending interrupt bits.

OOoh, yes, this is the part that wasn't getting in my head. I was
stuck thinking it was an XN bypass (due to the icache mention). Got
it.

> A speculative prefetch is a read as far as a device is concerned, so
> bypassing the XN permission by switching the domain to manager mode has
> the effect that the processor can then _legally_ speculatively prefetch
> from a device, and if it happens to hit a device that contains a read
> sensitive location, the side effects of reading that location will
> happen, even though the program did not perform an explicit read.
>
>> Just to make sure I understand: it was only speculative prefetch vs
>> icache, right? Would an icache flush restore the correct permissions?
>
> It's not about permissions, it's about the side effects at the device
> of a read created by the speculative prefetch.
>
>> I'm just pondering alternatives. Also, is there a maximum distance the
>> prefetching spans? i.e. could device memory be guaranteed to be
>> vmapped far enough away from kernel memory to avoid prefetches?
>
> The root cause of this problem is the way we lump both vmalloc() and
> ioremap() mappings into the same memory space (vmalloc region) without
> caring about the domain.

Okay, I patched ptdump (badly) to answer some questions I had on this,
but it looks like everything I was curious about is DOMAIN_KERNEL.

The memory area that write-rarely would want to touch would only be
the .rodata segment, which we could put under a different domain that
looked otherwise identical to DOMAIN_KERNEL. This would apply to
modules too, since those appear to be below the kernel, and not part
of the vmalloc area. Instead of dealing with vmalloc vs ioremap, don't
we just need to adjust the domain of the kernel (or just rodata) and
modules mappings then? We don't need to touch vmalloc at all.

Maybe I'm (still) missing something...

-Kees

> If all device memory was guaranteed to be placed under a different
> domain, then this problem would not exist.  In order to achieve that,
> there's several ways I can think of doing it:
>
> 1) Have separate virtual memory regions for ioremap() and vmalloc()
>    We would need to choose an arbitary limit on the size of these
>    memory pools, which may not suit everyone.
>
> 2) Have vmalloc() grow up as a heap, ioremap() grow down as a stack
>    and a dynamic boundary (aligned to 1 or 2MB) between the two, no
>    mixing allowed.  This avoids the problem with (1) but still results
>    in the required separation.
>
> 3) Align vmalloc region allocations to 2MB, but this would be very
>    wasteful.
>
> 4) Only permit same type (ioremap/vmalloc) of mapping within a 2MB block
>    of vmalloc space.  In other words, a primary allocator of 2MB blocks
>    and a sub-allocator of page-sized blocks (think of the way our
>    page allocator vs slab works.)  Probably going to be subject to
>    fragmentation problems.
>
> 5) Place all vmalloc() and ioremap() mappings under a separate domain,
>    so that all these mappings would be unaffected by the change of
>    domain settings (the resulting permissions would never change.)
>    In other words, DOMAIN_IO becomes DOMAIN_VMALLOC and is used for all
>    mappings in vmalloc space.
>
> The problem with (2) and (5) is teaching pte_alloc_kernel() down to
> pmd_populate_kernel() about the differences - currently, this only ever
> sets up DOMAIN_KERNEL mappings because there's no way for it to know
> what kind of mapping is required.
>
> --
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.
diff mbox

Patch

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 186c4c214e0a..c3e1369e7429 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -39,6 +39,7 @@  config ARM
 	select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
 	select HAVE_ARCH_MMAP_RND_BITS if MMU
+	select HAVE_ARCH_RARE_WRITE if MMU && !ARM_LPAE && !CPU_USE_DOMAINS
 	select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
diff --git a/arch/arm/include/asm/domain.h b/arch/arm/include/asm/domain.h
index 99d9f630d6b6..6c72f533382c 100644
--- a/arch/arm/include/asm/domain.h
+++ b/arch/arm/include/asm/domain.h
@@ -53,6 +53,7 @@ 
 #define DOMAIN_MANAGER	3
 #else
 #define DOMAIN_MANAGER	1
+#define DOMAIN_RARE_WRITE 3
 #endif
 
 #define domain_mask(dom)	((3) << (2 * (dom)))
@@ -115,7 +116,7 @@  static inline void set_domain(unsigned val)
 }
 #endif
 
-#ifdef CONFIG_CPU_USE_DOMAINS
+#if defined(CONFIG_CPU_USE_DOMAINS) || defined(CONFIG_HAVE_ARCH_RARE_WRITE)
 #define modify_domain(dom,type)					\
 	do {							\
 		unsigned int domain = get_domain();		\
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index a8d656d9aec7..c9492f3e9581 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -56,6 +56,33 @@  extern void __pgd_error(const char *file, int line, pgd_t);
 #define pmd_ERROR(pmd)		__pmd_error(__FILE__, __LINE__, pmd)
 #define pgd_ERROR(pgd)		__pgd_error(__FILE__, __LINE__, pgd)
 
+#ifdef CONFIG_HAVE_ARCH_RARE_WRITE
+#include <asm/domain.h>
+#include <linux/preempt.h>
+
+static inline int test_domain(int domain, int domaintype)
+{
+	return (get_domain() & domain_val(domain, 3)) ==
+		domain_val(domain, domaintype);
+}
+
+static inline unsigned long __arch_rare_write_map(void)
+{
+	preempt_disable();
+	BUG_ON(test_domain(DOMAIN_KERNEL, DOMAIN_RARE_WRITE));
+	modify_domain(DOMAIN_KERNEL, DOMAIN_RARE_WRITE);
+	return 0;
+}
+
+static inline unsigned long __arch_rare_write_unmap(void)
+{
+	BUG_ON(test_domain(DOMAIN_KERNEL, DOMAIN_MANAGER));
+	modify_domain(DOMAIN_KERNEL, DOMAIN_MANAGER);
+	preempt_enable_no_resched();
+	return 0;
+}
+#endif
+
 /*
  * This is the lowest virtual address we can permit any user space
  * mapping to be mapped at.  This is particularly important for