[v6,14/27] x86/percpu: Adapt percpu for PIE support
diff mbox series

Message ID 20190131192533.34130-15-thgarnie@chromium.org
State New
Headers show
Series
  • x86: PIE support and option to extend KASLR randomization
Related show

Commit Message

Thomas Garnier Jan. 31, 2019, 7:24 p.m. UTC
Perpcu uses a clever design where the .percu ELF section has a virtual
address of zero and the custom linux relocation code avoid relocating
specific symbols. It makes the code simple and easily adaptable with or
without SMP support.

This design is incompatible with PIE. While creating a PIE binary, the
copmiler tries to make everything relative. The compiler will attempt to
generate instructions with the distance between zero and any 64-bit
virtual address. It will fail as the relocation range cannot fit within
the possible instructions accessing a segment register.

This patch solves tihs problem by removing the zero mapping. The .percpu
symbols are now close to the base of the kernel and the compiler
generates appropriate relocations. To accomodate this change, the GS base
is adapted to be the difference between zero and the .percpu section
address. These changes are done only when PIE is enabled. The original
implementation is kept as-is by default.

The assembly and PER_CPU macros are changed to use relative references
when PIE is enabled.

The KALLSYMS_ABSOLUTE_PERCPU configuration is disabled with PIE given
percpu symbols are not absolute in this case.

Position Independent Executable (PIE) support will allow to extend the
KASLR randomization range below 0xffffffff80000000.

Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
---
 arch/x86/entry/calling.h         |  2 +-
 arch/x86/entry/entry_64.S        |  4 ++--
 arch/x86/include/asm/percpu.h    | 25 +++++++++++++++++++------
 arch/x86/include/asm/processor.h |  4 +++-
 arch/x86/kernel/head_64.S        |  4 ++++
 arch/x86/kernel/setup_percpu.c   |  5 ++++-
 arch/x86/kernel/vmlinux.lds.S    | 13 +++++++++++--
 arch/x86/lib/cmpxchg16b_emu.S    |  8 ++++----
 arch/x86/xen/xen-asm.S           | 12 ++++++------
 init/Kconfig                     |  2 +-
 10 files changed, 55 insertions(+), 24 deletions(-)

Comments

Christopher Lameter Jan. 31, 2019, 8:57 p.m. UTC | #1
On Thu, 31 Jan 2019, Thomas Garnier wrote:

> Perpcu uses a clever design where the .percu ELF section has a virtual
> address of zero and the custom linux relocation code avoid relocating
> specific symbols. It makes the code simple and easily adaptable with or
> without SMP support.

We usually talk about this as offsets rather than addressess. The intend
here is to give every processor its own address that is unique for this
processor. Operations are always relative to a segment register and the
whole area can be relocated at will by simply changing the segment
register.

> This design is incompatible with PIE. While creating a PIE binary, the
> copmiler tries to make everything relative. The compiler will attempt to

This is very compatible with PIE because it is already relative.

> generate instructions with the distance between zero and any 64-bit
> virtual address. It will fail as the relocation range cannot fit within
> the possible instructions accessing a segment register.

Leave the offsets alone and just change the segment register if you need
to relocate the area of a specific processor?

> The assembly and PER_CPU macros are changed to use relative references
> when PIE is enabled.

They already use relative reference. What is the point here?

> --- a/arch/x86/include/asm/percpu.h
> +++ b/arch/x86/include/asm/percpu.h
> @@ -5,9 +5,11 @@
>  #ifdef CONFIG_X86_64
>  #define __percpu_seg		gs
>  #define __percpu_mov_op		movq
> +#define __percpu_rel		(%rip)

The percpu section cannot be IP relative since we need to have separate
address spaces per cpu.
Thomas Garnier Jan. 31, 2019, 10:49 p.m. UTC | #2
On Thu, Jan 31, 2019 at 12:57 PM Christopher Lameter <cl@linux.com> wrote:
>
> On Thu, 31 Jan 2019, Thomas Garnier wrote:
>
> > Perpcu uses a clever design where the .percu ELF section has a virtual
> > address of zero and the custom linux relocation code avoid relocating
> > specific symbols. It makes the code simple and easily adaptable with or
> > without SMP support.
>
> We usually talk about this as offsets rather than addressess. The intend
> here is to give every processor its own address that is unique for this
> processor. Operations are always relative to a segment register and the
> whole area can be relocated at will by simply changing the segment
> register.
>
> > This design is incompatible with PIE. While creating a PIE binary, the
> > copmiler tries to make everything relative. The compiler will attempt to
>
> This is very compatible with PIE because it is already relative.

The per-cpu symbols are in a section that is zero based to create
offsets. The compiler doesn't see them as offsets but as relative
symbol and try to relocate them. Given the distance between zero and
the mapped kernel is much larger than the instruction offset range, it
fails to do it.

>
> > generate instructions with the distance between zero and any 64-bit
> > virtual address. It will fail as the relocation range cannot fit within
> > the possible instructions accessing a segment register.
>
> Leave the offsets alone and just change the segment register if you need
> to relocate the area of a specific processor?
>
> > The assembly and PER_CPU macros are changed to use relative references
> > when PIE is enabled.
>
> They already use relative reference. What is the point here?
>
> > --- a/arch/x86/include/asm/percpu.h
> > +++ b/arch/x86/include/asm/percpu.h
> > @@ -5,9 +5,11 @@
> >  #ifdef CONFIG_X86_64
> >  #define __percpu_seg         gs
> >  #define __percpu_mov_op              movq
> > +#define __percpu_rel         (%rip)
>
> The percpu section cannot be IP relative since we need to have separate
> address spaces per cpu.
>
Christopher Lameter Feb. 1, 2019, 2:31 a.m. UTC | #3
On Thu, 31 Jan 2019, Thomas Garnier wrote:

> The per-cpu symbols are in a section that is zero based to create
> offsets. The compiler doesn't see them as offsets but as relative
> symbol and try to relocate them. Given the distance between zero and
> the mapped kernel is much larger than the instruction offset range, it
> fails to do it.

We switch that off in the linker. If that does not work with your
modifications then you need to figure out how to update the link
configuration.
Thomas Garnier Feb. 1, 2019, 5:13 p.m. UTC | #4
On Thu, Jan 31, 2019 at 6:31 PM Christopher Lameter <cl@linux.com> wrote:
>
> On Thu, 31 Jan 2019, Thomas Garnier wrote:
>
> > The per-cpu symbols are in a section that is zero based to create
> > offsets. The compiler doesn't see them as offsets but as relative
> > symbol and try to relocate them. Given the distance between zero and
> > the mapped kernel is much larger than the instruction offset range, it
> > fails to do it.
>
> We switch that off in the linker. If that does not work with your
> modifications then you need to figure out how to update the link
> configuration.
>

It didn't work originally but I will revisit to see if I missed something.
Thomas Garnier April 8, 2019, 3:58 p.m. UTC | #5
On Fri, Feb 1, 2019 at 9:13 AM Thomas Garnier <thgarnie@chromium.org> wrote:
>
> On Thu, Jan 31, 2019 at 6:31 PM Christopher Lameter <cl@linux.com> wrote:
> >
> > On Thu, 31 Jan 2019, Thomas Garnier wrote:
> >
> > > The per-cpu symbols are in a section that is zero based to create
> > > offsets. The compiler doesn't see them as offsets but as relative
> > > symbol and try to relocate them. Given the distance between zero and
> > > the mapped kernel is much larger than the instruction offset range, it
> > > fails to do it.
> >
> > We switch that off in the linker. If that does not work with your
> > modifications then you need to figure out how to update the link
> > configuration.
> >
>
> It didn't work originally but I will revisit to see if I missed something.

I revisited and couldn't find a way to prevent relocations to the
percpu section. Without PIE, you can reference absolute address which
was convenient for percpu.

Christopher: Did you have something specific in mind?

I checked the following:
 - Changing the FLAGS() on the PHDRS.
 - using -z noreloc-overflow which actually doesn't seem to apply to
PC32 relocations.
 - Look at all linker options and script format for anything around that.
Christopher Lameter April 8, 2019, 5:56 p.m. UTC | #6
On Mon, 8 Apr 2019, Thomas Garnier wrote:

> > It didn't work originally but I will revisit to see if I missed something.
>
> I revisited and couldn't find a way to prevent relocations to the
> percpu section. Without PIE, you can reference absolute address which
> was convenient for percpu.

Can you switch PIE off for the percpu section? If not maybe the linker
needs to have an additional option?

Cannot imagine that this is not possible. You neeed to be able to
reference registers that are in fixed memory locations.


> Christopher: Did you have something specific in mind?

I thought that we just leave it as is.
Thomas Garnier April 8, 2019, 6:08 p.m. UTC | #7
On Mon, Apr 8, 2019 at 10:56 AM Christopher Lameter <cl@linux.com> wrote:
>
> On Mon, 8 Apr 2019, Thomas Garnier wrote:
>
> > > It didn't work originally but I will revisit to see if I missed something.
> >
> > I revisited and couldn't find a way to prevent relocations to the
> > percpu section. Without PIE, you can reference absolute address which
> > was convenient for percpu.
>
> Can you switch PIE off for the percpu section? If not maybe the linker
> needs to have an additional option?

I don't think so or I didn't find any option to do that. Changing the
linker might be a bit too much if we have a software solution which
doesn't impact performance.

>
> Cannot imagine that this is not possible. You neeed to be able to
> reference registers that are in fixed memory locations.
>
>
> > Christopher: Did you have something specific in mind?
>
> I thought that we just leave it as is.

I would like to as well. I will try couple things at the assembly
level instead of the linker and come back to this thread.

Patch
diff mbox series

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index efb0d1b1f15f..d5a6d3a0c24b 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -218,7 +218,7 @@  For 32-bit we have the following conventions - kernel is built with
 .endm
 
 #define THIS_CPU_user_pcid_flush_mask   \
-	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
+	PER_CPU_VAR(cpu_tlbstate + TLB_STATE_user_pcid_flush_mask)
 
 .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 16a93eb4c11f..fc15fe058d3c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -298,7 +298,7 @@  ENTRY(__switch_to_asm)
 
 #ifdef CONFIG_STACKPROTECTOR
 	movq	TASK_stack_canary(%rsi), %rbx
-	movq	%rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset
+	movq	%rbx, PER_CPU_VAR(irq_stack_union + stack_canary_offset)
 #endif
 
 #ifdef CONFIG_RETPOLINE
@@ -841,7 +841,7 @@  apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
 /*
  * Exception entry points.
  */
-#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw) + (TSS_ist + ((x) - 1) * 8)
+#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw + (TSS_ist + ((x) - 1) * 8))
 
 /**
  * idtentry - Generate an IDT entry stub
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 1a19d11cfbbd..608c15751f29 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -5,9 +5,11 @@ 
 #ifdef CONFIG_X86_64
 #define __percpu_seg		gs
 #define __percpu_mov_op		movq
+#define __percpu_rel		(%rip)
 #else
 #define __percpu_seg		fs
 #define __percpu_mov_op		movl
+#define __percpu_rel
 #endif
 
 #ifdef __ASSEMBLY__
@@ -28,10 +30,14 @@ 
 #define PER_CPU(var, reg)						\
 	__percpu_mov_op %__percpu_seg:this_cpu_off, reg;		\
 	lea var(reg), reg
-#define PER_CPU_VAR(var)	%__percpu_seg:var
+/* Compatible with Position Independent Code */
+#define PER_CPU_VAR(var)		%__percpu_seg:(var)##__percpu_rel
+/* Rare absolute reference */
+#define PER_CPU_VAR_ABS(var)		%__percpu_seg:var
 #else /* ! SMP */
 #define PER_CPU(var, reg)	__percpu_mov_op $var, reg
-#define PER_CPU_VAR(var)	var
+#define PER_CPU_VAR(var)	(var)##__percpu_rel
+#define PER_CPU_VAR_ABS(var)	var
 #endif	/* SMP */
 
 #ifdef CONFIG_X86_64_SMP
@@ -209,27 +215,34 @@  do {									\
 	pfo_ret__;					\
 })
 
+/* Position Independent code uses relative addresses only */
+#ifdef CONFIG_X86_PIE
+#define __percpu_stable_arg __percpu_arg(a1)
+#else
+#define __percpu_stable_arg __percpu_arg(P1)
+#endif
+
 #define percpu_stable_op(op, var)			\
 ({							\
 	typeof(var) pfo_ret__;				\
 	switch (sizeof(var)) {				\
 	case 1:						\
-		asm(op "b "__percpu_arg(P1)",%0"	\
+		asm(op "b "__percpu_stable_arg ",%0"	\
 		    : "=q" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
 	case 2:						\
-		asm(op "w "__percpu_arg(P1)",%0"	\
+		asm(op "w "__percpu_stable_arg ",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
 	case 4:						\
-		asm(op "l "__percpu_arg(P1)",%0"	\
+		asm(op "l "__percpu_stable_arg ",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
 	case 8:						\
-		asm(op "q "__percpu_arg(P1)",%0"	\
+		asm(op "q "__percpu_stable_arg ",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index ce9851bf6778..18f1e8269ad7 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -24,6 +24,7 @@  struct vm86;
 #include <asm/special_insns.h>
 #include <asm/fpu/types.h>
 #include <asm/unwind_hints.h>
+#include <asm/sections.h>
 
 #include <linux/personality.h>
 #include <linux/cache.h>
@@ -402,7 +403,8 @@  DECLARE_INIT_PER_CPU(irq_stack_union);
 
 static inline unsigned long cpu_kernelmode_gs_base(int cpu)
 {
-	return (unsigned long)per_cpu(irq_stack_union.gs_base, cpu);
+	return (unsigned long)per_cpu(irq_stack_union.gs_base, cpu) -
+		(unsigned long)__per_cpu_start;
 }
 
 DECLARE_PER_CPU(char *, irq_stack_ptr);
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index b9b6c6aa0313..0f1739d7bff7 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -269,7 +269,11 @@  ENDPROC(start_cpu0)
 	GLOBAL(initial_code)
 	.quad	x86_64_start_kernel
 	GLOBAL(initial_gs)
+#ifdef CONFIG_X86_PIE
+	.quad	0
+#else
 	.quad	INIT_PER_CPU_VAR(irq_stack_union)
+#endif
 	GLOBAL(initial_stack)
 	/*
 	 * The SIZEOF_PTREGS gap is a convention which helps the in-kernel
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index e8796fcd7e5a..cc66f3434da9 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -26,7 +26,7 @@ 
 DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
 EXPORT_PER_CPU_SYMBOL(cpu_number);
 
-#ifdef CONFIG_X86_64
+#if defined(CONFIG_X86_64) && !defined(CONFIG_X86_PIE)
 #define BOOT_PERCPU_OFFSET ((unsigned long)__per_cpu_load)
 #else
 #define BOOT_PERCPU_OFFSET 0
@@ -40,6 +40,9 @@  unsigned long __per_cpu_offset[NR_CPUS] __ro_after_init = {
 };
 EXPORT_SYMBOL(__per_cpu_offset);
 
+/* Used to calculate gs_base for each CPU */
+EXPORT_SYMBOL(__per_cpu_start);
+
 /*
  * On x86_64 symbols referenced from code should be reachable using
  * 32bit relocations.  Reserve space for static percpu variables in
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index bad8c51fee6e..7b461fa82107 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -222,9 +222,14 @@  SECTIONS
 	/*
 	 * percpu offsets are zero-based on SMP.  PERCPU_VADDR() changes the
 	 * output PHDR, so the next output section - .init.text - should
-	 * start another segment - init.
+	 * start another segment - init. For Position Independent Code, the
+	 * per-cpu section cannot be zero-based because everything is relative.
 	 */
+#ifdef CONFIG_X86_PIE
+	PERCPU_SECTION(INTERNODE_CACHE_BYTES)
+#else
 	PERCPU_VADDR(INTERNODE_CACHE_BYTES, 0, :percpu)
+#endif
 	ASSERT(SIZEOF(.data..percpu) < CONFIG_PHYSICAL_START,
 	       "per-CPU data too large - increase CONFIG_PHYSICAL_START")
 #endif
@@ -401,7 +406,11 @@  SECTIONS
  * Per-cpu symbols which need to be offset from __per_cpu_load
  * for the boot processor.
  */
+#ifdef CONFIG_X86_PIE
+#define INIT_PER_CPU(x) init_per_cpu__##x = ABSOLUTE(x)
+#else
 #define INIT_PER_CPU(x) init_per_cpu__##x = ABSOLUTE(x) + __per_cpu_load
+#endif
 INIT_PER_CPU(gdt_page);
 INIT_PER_CPU(irq_stack_union);
 
@@ -411,7 +420,7 @@  INIT_PER_CPU(irq_stack_union);
 . = ASSERT((_end - _text <= KERNEL_IMAGE_SIZE),
 	   "kernel image bigger than KERNEL_IMAGE_SIZE");
 
-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) && !defined(CONFIG_X86_PIE)
 . = ASSERT((irq_stack_union == 0),
            "irq_stack_union is not at start of per-cpu area");
 #endif
diff --git a/arch/x86/lib/cmpxchg16b_emu.S b/arch/x86/lib/cmpxchg16b_emu.S
index 9b330242e740..254950604ae4 100644
--- a/arch/x86/lib/cmpxchg16b_emu.S
+++ b/arch/x86/lib/cmpxchg16b_emu.S
@@ -33,13 +33,13 @@  ENTRY(this_cpu_cmpxchg16b_emu)
 	pushfq
 	cli
 
-	cmpq PER_CPU_VAR((%rsi)), %rax
+	cmpq PER_CPU_VAR_ABS((%rsi)), %rax
 	jne .Lnot_same
-	cmpq PER_CPU_VAR(8(%rsi)), %rdx
+	cmpq PER_CPU_VAR_ABS(8(%rsi)), %rdx
 	jne .Lnot_same
 
-	movq %rbx, PER_CPU_VAR((%rsi))
-	movq %rcx, PER_CPU_VAR(8(%rsi))
+	movq %rbx, PER_CPU_VAR_ABS((%rsi))
+	movq %rcx, PER_CPU_VAR_ABS(8(%rsi))
 
 	popfq
 	mov $1, %al
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 8019edd0125c..a5d73d3218be 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -21,7 +21,7 @@ 
 ENTRY(xen_irq_enable_direct)
 	FRAME_BEGIN
 	/* Unmask events */
-	movb $0, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+	movb $0, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 
 	/*
 	 * Preempt here doesn't matter because that will deal with any
@@ -30,7 +30,7 @@  ENTRY(xen_irq_enable_direct)
 	 */
 
 	/* Test for pending */
-	testb $0xff, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_pending
+	testb $0xff, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_pending)
 	jz 1f
 
 	call check_events
@@ -45,7 +45,7 @@  ENTRY(xen_irq_enable_direct)
  * non-zero.
  */
 ENTRY(xen_irq_disable_direct)
-	movb $1, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+	movb $1, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 	ret
 ENDPROC(xen_irq_disable_direct)
 
@@ -59,7 +59,7 @@  ENDPROC(xen_irq_disable_direct)
  * x86 use opposite senses (mask vs enable).
  */
 ENTRY(xen_save_fl_direct)
-	testb $0xff, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+	testb $0xff, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 	setz %ah
 	addb %ah, %ah
 	ret
@@ -80,7 +80,7 @@  ENTRY(xen_restore_fl_direct)
 #else
 	testb $X86_EFLAGS_IF>>8, %ah
 #endif
-	setz PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+	setz PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 	/*
 	 * Preempt here doesn't matter because that will deal with any
 	 * pending interrupts.  The pending check may end up being run
@@ -88,7 +88,7 @@  ENTRY(xen_restore_fl_direct)
 	 */
 
 	/* check for unmasked and pending */
-	cmpw $0x0001, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_pending
+	cmpw $0x0001, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_pending)
 	jnz 1f
 	call check_events
 1:
diff --git a/init/Kconfig b/init/Kconfig
index 1486b913daeb..bb383615823a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1453,7 +1453,7 @@  config KALLSYMS_ALL
 config KALLSYMS_ABSOLUTE_PERCPU
 	bool
 	depends on KALLSYMS
-	default X86_64 && SMP
+	default X86_64 && SMP && !X86_PIE
 
 config KALLSYMS_BASE_RELATIVE
 	bool