diff mbox series

[4.19,079/276] x86/modules: Avoid breaking W^X while loading modules

Message ID 20190530030531.331779845@linuxfoundation.org (mailing list archive)
State New, archived
Headers show
Series None | expand

Commit Message

Greg KH May 30, 2019, 3:03 a.m. UTC
[ Upstream commit f2c65fb3221adc6b73b0549fc7ba892022db9797 ]

When modules and BPF filters are loaded, there is a time window in
which some memory is both writable and executable. An attacker that has
already found another vulnerability (e.g., a dangling pointer) might be
able to exploit this behavior to overwrite kernel code. Prevent having
writable executable PTEs in this stage.

In addition, avoiding having W+X mappings can also slightly simplify the
patching of modules code on initialization (e.g., by alternatives and
static-key), as would be done in the next patch. This was actually the
main motivation for this patch.

To avoid having W+X mappings, set them initially as RW (NX) and after
they are set as RO set them as X as well. Setting them as executable is
done as a separate step to avoid one core in which the old PTE is cached
(hence writable), and another which sees the updated PTE (executable),
which would break the W^X protection.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lkml.kernel.org/r/20190426001143.4983-12-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/kernel/alternative.c | 28 +++++++++++++++++++++-------
 arch/x86/kernel/module.c      |  2 +-
 include/linux/filter.h        |  1 +
 kernel/module.c               |  5 +++++
 4 files changed, 28 insertions(+), 8 deletions(-)

Comments

Pavel Machek May 31, 2019, 10:37 a.m. UTC | #1
Hi!

> [ Upstream commit f2c65fb3221adc6b73b0549fc7ba892022db9797 ]
> 
> When modules and BPF filters are loaded, there is a time window in
> which some memory is both writable and executable. An attacker that has
> already found another vulnerability (e.g., a dangling pointer) might be
> able to exploit this behavior to overwrite kernel code. Prevent having
> writable executable PTEs in this stage.
> 
> In addition, avoiding having W+X mappings can also slightly simplify the
> patching of modules code on initialization (e.g., by alternatives and
> static-key), as would be done in the next patch. This was actually the
> main motivation for this patch.
> 
> To avoid having W+X mappings, set them initially as RW (NX) and after
> they are set as RO set them as X as well. Setting them as executable is
> done as a separate step to avoid one core in which the old PTE is cached
> (hence writable), and another which sees the updated PTE (executable),
> which would break the W^X protection.

First, is this stable material? Yes, it changes something.

But if you assume attacker can write into kernel memory during module
load, what prevents him to change the module as he sees fit while it
is not executable, simply waiting for system to execute it?

I don't see security benefit here.

> +++ b/arch/x86/kernel/alternative.c
> @@ -662,15 +662,29 @@ void __init alternative_instructions(void)
>   * handlers seeing an inconsistent instruction while you patch.
>   */
>  void *__init_or_module text_poke_early(void *addr, const void *opcode,
> -					      size_t len)
> +				       size_t len)
>  {
>  	unsigned long flags;
> -	local_irq_save(flags);
> -	memcpy(addr, opcode, len);
> -	local_irq_restore(flags);
> -	sync_core();
> -	/* Could also do a CLFLUSH here to speed up CPU recovery; but
> -	   that causes hangs on some VIA CPUs. */
> +
> +	if (boot_cpu_has(X86_FEATURE_NX) &&
> +	    is_module_text_address((unsigned long)addr)) {
> +		/*
> +		 * Modules text is marked initially as non-executable, so the
> +		 * code cannot be running and speculative code-fetches are
> +		 * prevented. Just change the code.
> +		 */
> +		memcpy(addr, opcode, len);
> +	} else {
> +		local_irq_save(flags);
> +		memcpy(addr, opcode, len);
> +		local_irq_restore(flags);
> +		sync_core();
> +
> +		/*
> +		 * Could also do a CLFLUSH here to speed up CPU recovery; but
> +		 * that causes hangs on some VIA CPUs.
> +		 */

I don't get it. If code can not be running here, it can not be running
in the !NX case, either, and we are free to just change
it. Speculative execution should not be a problem, either, as CPUs are
supposed to mask it, and there are no known bugs in that area. (Plus,
I'd not be surprise if speculative execution ignored NX... just saying
:-) )


									Pavel
Nadav Amit May 31, 2019, 9:47 p.m. UTC | #2
> On May 31, 2019, at 3:37 AM, Pavel Machek <pavel@denx.de> wrote:
> 
> Hi!
> 
>> [ Upstream commit f2c65fb3221adc6b73b0549fc7ba892022db9797 ]
>> 
>> When modules and BPF filters are loaded, there is a time window in
>> which some memory is both writable and executable. An attacker that has
>> already found another vulnerability (e.g., a dangling pointer) might be
>> able to exploit this behavior to overwrite kernel code. Prevent having
>> writable executable PTEs in this stage.
>> 
>> In addition, avoiding having W+X mappings can also slightly simplify the
>> patching of modules code on initialization (e.g., by alternatives and
>> static-key), as would be done in the next patch. This was actually the
>> main motivation for this patch.
>> 
>> To avoid having W+X mappings, set them initially as RW (NX) and after
>> they are set as RO set them as X as well. Setting them as executable is
>> done as a separate step to avoid one core in which the old PTE is cached
>> (hence writable), and another which sees the updated PTE (executable),
>> which would break the W^X protection.
> 
> First, is this stable material? Yes, it changes something.
> 
> But if you assume attacker can write into kernel memory during module
> load, what prevents him to change the module as he sees fit while it
> is not executable, simply waiting for system to execute it?
> 
> I don't see security benefit here.

I agree that at the moment the benefit it limited. I think the benefit would
come later, if the module signature check is performed after the module has
been write-protected, but before it is actually executed.

>> +++ b/arch/x86/kernel/alternative.c
>> @@ -662,15 +662,29 @@ void __init alternative_instructions(void)
>>  * handlers seeing an inconsistent instruction while you patch.
>>  */
>> void *__init_or_module text_poke_early(void *addr, const void *opcode,
>> -					      size_t len)
>> +				       size_t len)
>> {
>> 	unsigned long flags;
>> -	local_irq_save(flags);
>> -	memcpy(addr, opcode, len);
>> -	local_irq_restore(flags);
>> -	sync_core();
>> -	/* Could also do a CLFLUSH here to speed up CPU recovery; but
>> -	   that causes hangs on some VIA CPUs. */
>> +
>> +	if (boot_cpu_has(X86_FEATURE_NX) &&
>> +	    is_module_text_address((unsigned long)addr)) {
>> +		/*
>> +		 * Modules text is marked initially as non-executable, so the
>> +		 * code cannot be running and speculative code-fetches are
>> +		 * prevented. Just change the code.
>> +		 */
>> +		memcpy(addr, opcode, len);
>> +	} else {
>> +		local_irq_save(flags);
>> +		memcpy(addr, opcode, len);
>> +		local_irq_restore(flags);
>> +		sync_core();
>> +
>> +		/*
>> +		 * Could also do a CLFLUSH here to speed up CPU recovery; but
>> +		 * that causes hangs on some VIA CPUs.
>> +		 */
> 
> I don't get it. If code can not be running here, it can not be running
> in the !NX case, either, and we are free to just change
> it. Speculative execution should not be a problem, either, as CPUs are
> supposed to mask it, and there are no known bugs in that area. (Plus,
> I'd not be surprise if speculative execution ignored NX... just saying
> :-) )

Yes, the module code should not run, but speculative execution might cause
it to be cached in the instruction cache (as unlikely as it might be, but
we need to consider malicious users that play with branch predictors).

I am unfamiliar with any bug that might cause the CPU to speculatively
ignore the NX bit. Without underestimating Intel’s ability to create
terrible bugs, I would assume, for now, that it is safe.
diff mbox series

Patch

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index b9d5e7c9ef43e..918a23704c0c5 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -662,15 +662,29 @@  void __init alternative_instructions(void)
  * handlers seeing an inconsistent instruction while you patch.
  */
 void *__init_or_module text_poke_early(void *addr, const void *opcode,
-					      size_t len)
+				       size_t len)
 {
 	unsigned long flags;
-	local_irq_save(flags);
-	memcpy(addr, opcode, len);
-	local_irq_restore(flags);
-	sync_core();
-	/* Could also do a CLFLUSH here to speed up CPU recovery; but
-	   that causes hangs on some VIA CPUs. */
+
+	if (boot_cpu_has(X86_FEATURE_NX) &&
+	    is_module_text_address((unsigned long)addr)) {
+		/*
+		 * Modules text is marked initially as non-executable, so the
+		 * code cannot be running and speculative code-fetches are
+		 * prevented. Just change the code.
+		 */
+		memcpy(addr, opcode, len);
+	} else {
+		local_irq_save(flags);
+		memcpy(addr, opcode, len);
+		local_irq_restore(flags);
+		sync_core();
+
+		/*
+		 * Could also do a CLFLUSH here to speed up CPU recovery; but
+		 * that causes hangs on some VIA CPUs.
+		 */
+	}
 	return addr;
 }
 
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index f58336af095c9..6645f123419c6 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -87,7 +87,7 @@  void *module_alloc(unsigned long size)
 	p = __vmalloc_node_range(size, MODULE_ALIGN,
 				    MODULES_VADDR + get_module_load_offset(),
 				    MODULES_END, GFP_KERNEL,
-				    PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
+				    PAGE_KERNEL, 0, NUMA_NO_NODE,
 				    __builtin_return_address(0));
 	if (p && (kasan_module_alloc(p, size) < 0)) {
 		vfree(p);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6ecef32fd8bcd..d52a7484aeb2d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -684,6 +684,7 @@  static inline void bpf_prog_unlock_ro(struct bpf_prog *fp)
 static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
 {
 	set_memory_ro((unsigned long)hdr, hdr->pages);
+	set_memory_x((unsigned long)hdr, hdr->pages);
 }
 
 static inline void bpf_jit_binary_unlock_ro(struct bpf_binary_header *hdr)
diff --git a/kernel/module.c b/kernel/module.c
index 38bf28b5cc202..f797c6ace7121 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1949,8 +1949,13 @@  void module_enable_ro(const struct module *mod, bool after_init)
 		return;
 
 	frob_text(&mod->core_layout, set_memory_ro);
+	frob_text(&mod->core_layout, set_memory_x);
+
 	frob_rodata(&mod->core_layout, set_memory_ro);
+
 	frob_text(&mod->init_layout, set_memory_ro);
+	frob_text(&mod->init_layout, set_memory_x);
+
 	frob_rodata(&mod->init_layout, set_memory_ro);
 
 	if (after_init)