diff mbox

[RFC,16/22] x86/percpu: Adapt percpu for PIE support

Message ID 20170718223333.110371-17-thgarnie@google.com (mailing list archive)
State New, archived
Headers show

Commit Message

Thomas Garnier July 18, 2017, 10:33 p.m. UTC
Perpcu uses a clever design where the .percu ELF section has a virtual
address of zero and the relocation code avoid relocating specific
symbols. It makes the code simple and easily adaptable with or without
SMP support.

This design is incompatible with PIE because generated code always try to
access the zero virtual address relative to the default mapping address.
It becomes impossible when KASLR is configured to go below -2G. This
patch solves this problem by removing the zero mapping and adapting the GS
base to be relative to the expected address. These changes are done only
when PIE is enabled. The original implementation is kept as-is
by default.

The assembly and PER_CPU macros are changed to use relative references
when PIE is enabled.

The KALLSYMS_ABSOLUTE_PERCPU configuration is disabled with PIE given
percpu symbols are not absolute in this case.

Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
---
 arch/x86/entry/entry_64.S      |  4 ++--
 arch/x86/include/asm/percpu.h  | 25 +++++++++++++++++++------
 arch/x86/kernel/cpu/common.c   |  4 +++-
 arch/x86/kernel/head_64.S      |  4 ++++
 arch/x86/kernel/setup_percpu.c |  2 +-
 arch/x86/kernel/vmlinux.lds.S  | 13 +++++++++++--
 arch/x86/lib/cmpxchg16b_emu.S  |  8 ++++----
 arch/x86/xen/xen-asm.S         | 12 ++++++------
 init/Kconfig                   |  2 +-
 9 files changed, 51 insertions(+), 23 deletions(-)

Comments

Brian Gerst July 19, 2017, 3:08 a.m. UTC | #1
On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <thgarnie@google.com> wrote:
> Perpcu uses a clever design where the .percu ELF section has a virtual
> address of zero and the relocation code avoid relocating specific
> symbols. It makes the code simple and easily adaptable with or without
> SMP support.
>
> This design is incompatible with PIE because generated code always try to
> access the zero virtual address relative to the default mapping address.
> It becomes impossible when KASLR is configured to go below -2G. This
> patch solves this problem by removing the zero mapping and adapting the GS
> base to be relative to the expected address. These changes are done only
> when PIE is enabled. The original implementation is kept as-is
> by default.

The reason the per-cpu section is zero-based on x86-64 is to
workaround GCC hardcoding the stack protector canary at %gs:40.  So
this patch is incompatible with CONFIG_STACK_PROTECTOR.

--
Brian Gerst
Thomas Garnier July 19, 2017, 6:26 p.m. UTC | #2
On Tue, Jul 18, 2017 at 8:08 PM, Brian Gerst <brgerst@gmail.com> wrote:
> On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <thgarnie@google.com> wrote:
>> Perpcu uses a clever design where the .percu ELF section has a virtual
>> address of zero and the relocation code avoid relocating specific
>> symbols. It makes the code simple and easily adaptable with or without
>> SMP support.
>>
>> This design is incompatible with PIE because generated code always try to
>> access the zero virtual address relative to the default mapping address.
>> It becomes impossible when KASLR is configured to go below -2G. This
>> patch solves this problem by removing the zero mapping and adapting the GS
>> base to be relative to the expected address. These changes are done only
>> when PIE is enabled. The original implementation is kept as-is
>> by default.
>
> The reason the per-cpu section is zero-based on x86-64 is to
> workaround GCC hardcoding the stack protector canary at %gs:40.  So
> this patch is incompatible with CONFIG_STACK_PROTECTOR.

Ok, that make sense. I don't want this feature to not work with
CONFIG_CC_STACKPROTECTOR*. One way to fix that would be adding a GDT
entry for gs so gs:40 points to the correct memory address and
gs:[rip+XX] works correctly through the MSR. Given the separate
discussion on mcmodel, I am going first to check if we can move from
PIE to PIC with a mcmodel=small or medium that would remove the percpu
change requirement. I tried before without success but I understand
better percpu and other components so maybe I can make it work.

Thanks a lot for the feedback.

>
> --
> Brian Gerst
H. Peter Anvin July 19, 2017, 11:33 p.m. UTC | #3
On 07/19/17 11:26, Thomas Garnier wrote:
> On Tue, Jul 18, 2017 at 8:08 PM, Brian Gerst <brgerst@gmail.com> wrote:
>> On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <thgarnie@google.com> wrote:
>>> Perpcu uses a clever design where the .percu ELF section has a virtual
>>> address of zero and the relocation code avoid relocating specific
>>> symbols. It makes the code simple and easily adaptable with or without
>>> SMP support.
>>>
>>> This design is incompatible with PIE because generated code always try to
>>> access the zero virtual address relative to the default mapping address.
>>> It becomes impossible when KASLR is configured to go below -2G. This
>>> patch solves this problem by removing the zero mapping and adapting the GS
>>> base to be relative to the expected address. These changes are done only
>>> when PIE is enabled. The original implementation is kept as-is
>>> by default.
>>
>> The reason the per-cpu section is zero-based on x86-64 is to
>> workaround GCC hardcoding the stack protector canary at %gs:40.  So
>> this patch is incompatible with CONFIG_STACK_PROTECTOR.
> 
> Ok, that make sense. I don't want this feature to not work with
> CONFIG_CC_STACKPROTECTOR*. One way to fix that would be adding a GDT
> entry for gs so gs:40 points to the correct memory address and
> gs:[rip+XX] works correctly through the MSR.

What are you talking about?  A GDT entry and the MSR do the same thing,
except that a GDT entry is limited to an offset of 0-0xffffffff (which
doesn't work for us, obviously.)

> Given the separate
> discussion on mcmodel, I am going first to check if we can move from
> PIE to PIC with a mcmodel=small or medium that would remove the percpu
> change requirement. I tried before without success but I understand
> better percpu and other components so maybe I can make it work.

>> This is silly.  The right thing is for PIE is to be explicitly absolute,
>> without (%rip).  The use of (%rip) memory references for percpu is just
>> an optimization.
> 
> I agree that it is odd but that's how the compiler generates code. I
> will re-explore PIC options with mcmodel=small or medium, as mentioned
> on other threads.

Why should the way compiler generates code affect the way we do things
in assembly?

That being said, the compiler now has support for generating this kind
of code explicitly via the __seg_gs pointer modifier.  That should let
us drop the __percpu_prefix and just use variables directly.  I suspect
we want to declare percpu variables as "volatile __seg_gs" to account
for the possibility of CPU switches.

Older compilers won't be able to work with this, of course, but I think
that it is acceptable for those older compilers to not be able to
support PIE.

	-hpa
H. Peter Anvin July 20, 2017, 2:21 a.m. UTC | #4
On 07/19/17 16:33, H. Peter Anvin wrote:
>>
>> I agree that it is odd but that's how the compiler generates code. I
>> will re-explore PIC options with mcmodel=small or medium, as mentioned
>> on other threads.
> 
> Why should the way compiler generates code affect the way we do things
> in assembly?
> 
> That being said, the compiler now has support for generating this kind
> of code explicitly via the __seg_gs pointer modifier.  That should let
> us drop the __percpu_prefix and just use variables directly.  I suspect
> we want to declare percpu variables as "volatile __seg_gs" to account
> for the possibility of CPU switches.
> 
> Older compilers won't be able to work with this, of course, but I think
> that it is acceptable for those older compilers to not be able to
> support PIE.
> 

Grump.  It turns out that the compiler doesn't do the right thing for
symbols marked with the __seg_[fg]s markers.  __thread does the right
thing, but __thread a) has %fs: hard-coded, still, and b) I believe can
still cache %seg:0 arbitrarily long.

	-hpa
H. Peter Anvin July 20, 2017, 3:03 a.m. UTC | #5
On 07/19/17 19:21, H. Peter Anvin wrote:
> On 07/19/17 16:33, H. Peter Anvin wrote:
>>>
>>> I agree that it is odd but that's how the compiler generates code. I
>>> will re-explore PIC options with mcmodel=small or medium, as mentioned
>>> on other threads.
>>
>> Why should the way compiler generates code affect the way we do things
>> in assembly?
>>
>> That being said, the compiler now has support for generating this kind
>> of code explicitly via the __seg_gs pointer modifier.  That should let
>> us drop the __percpu_prefix and just use variables directly.  I suspect
>> we want to declare percpu variables as "volatile __seg_gs" to account
>> for the possibility of CPU switches.
>>
>> Older compilers won't be able to work with this, of course, but I think
>> that it is acceptable for those older compilers to not be able to
>> support PIE.
>>
> 
> Grump.  It turns out that the compiler doesn't do the right thing for
> symbols marked with the __seg_[fg]s markers.  __thread does the right
> thing, but __thread a) has %fs: hard-coded, still, and b) I believe can
> still cache %seg:0 arbitrarily long.

I filed this bug report for gcc:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81490

It might still be possible to work around this by playing really ugly
games with __thread, but I haven't yet figured out how best to do that.

	-hpa
Thomas Garnier July 20, 2017, 2:26 p.m. UTC | #6
On Wed, Jul 19, 2017 at 4:33 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 07/19/17 11:26, Thomas Garnier wrote:
>> On Tue, Jul 18, 2017 at 8:08 PM, Brian Gerst <brgerst@gmail.com> wrote:
>>> On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <thgarnie@google.com> wrote:
>>>> Perpcu uses a clever design where the .percu ELF section has a virtual
>>>> address of zero and the relocation code avoid relocating specific
>>>> symbols. It makes the code simple and easily adaptable with or without
>>>> SMP support.
>>>>
>>>> This design is incompatible with PIE because generated code always try to
>>>> access the zero virtual address relative to the default mapping address.
>>>> It becomes impossible when KASLR is configured to go below -2G. This
>>>> patch solves this problem by removing the zero mapping and adapting the GS
>>>> base to be relative to the expected address. These changes are done only
>>>> when PIE is enabled. The original implementation is kept as-is
>>>> by default.
>>>
>>> The reason the per-cpu section is zero-based on x86-64 is to
>>> workaround GCC hardcoding the stack protector canary at %gs:40.  So
>>> this patch is incompatible with CONFIG_STACK_PROTECTOR.
>>
>> Ok, that make sense. I don't want this feature to not work with
>> CONFIG_CC_STACKPROTECTOR*. One way to fix that would be adding a GDT
>> entry for gs so gs:40 points to the correct memory address and
>> gs:[rip+XX] works correctly through the MSR.
>
> What are you talking about?  A GDT entry and the MSR do the same thing,
> except that a GDT entry is limited to an offset of 0-0xffffffff (which
> doesn't work for us, obviously.)
>

A GDT entry would allow gs:0x40 to be valid while all gs:[rip+XX]
addresses uses the MSR.

I didn't tested it but that was used on the RFG mitigation [1]. The fs
segment register was used for both thread storage and shadow stack.

[1] http://xlab.tencent.com/en/2016/11/02/return-flow-guard/

>> Given the separate
>> discussion on mcmodel, I am going first to check if we can move from
>> PIE to PIC with a mcmodel=small or medium that would remove the percpu
>> change requirement. I tried before without success but I understand
>> better percpu and other components so maybe I can make it work.
>
>>> This is silly.  The right thing is for PIE is to be explicitly absolute,
>>> without (%rip).  The use of (%rip) memory references for percpu is just
>>> an optimization.
>>
>> I agree that it is odd but that's how the compiler generates code. I
>> will re-explore PIC options with mcmodel=small or medium, as mentioned
>> on other threads.
>
> Why should the way compiler generates code affect the way we do things
> in assembly?
>
> That being said, the compiler now has support for generating this kind
> of code explicitly via the __seg_gs pointer modifier.  That should let
> us drop the __percpu_prefix and just use variables directly.  I suspect
> we want to declare percpu variables as "volatile __seg_gs" to account
> for the possibility of CPU switches.
>
> Older compilers won't be able to work with this, of course, but I think
> that it is acceptable for those older compilers to not be able to
> support PIE.
>
>         -hpa
>
Thomas Garnier Aug. 2, 2017, 4:42 p.m. UTC | #7
On Thu, Jul 20, 2017 at 7:26 AM, Thomas Garnier <thgarnie@google.com> wrote:
> On Wed, Jul 19, 2017 at 4:33 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 07/19/17 11:26, Thomas Garnier wrote:
>>> On Tue, Jul 18, 2017 at 8:08 PM, Brian Gerst <brgerst@gmail.com> wrote:
>>>> On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <thgarnie@google.com> wrote:
>>>>> Perpcu uses a clever design where the .percu ELF section has a virtual
>>>>> address of zero and the relocation code avoid relocating specific
>>>>> symbols. It makes the code simple and easily adaptable with or without
>>>>> SMP support.
>>>>>
>>>>> This design is incompatible with PIE because generated code always try to
>>>>> access the zero virtual address relative to the default mapping address.
>>>>> It becomes impossible when KASLR is configured to go below -2G. This
>>>>> patch solves this problem by removing the zero mapping and adapting the GS
>>>>> base to be relative to the expected address. These changes are done only
>>>>> when PIE is enabled. The original implementation is kept as-is
>>>>> by default.
>>>>
>>>> The reason the per-cpu section is zero-based on x86-64 is to
>>>> workaround GCC hardcoding the stack protector canary at %gs:40.  So
>>>> this patch is incompatible with CONFIG_STACK_PROTECTOR.
>>>
>>> Ok, that make sense. I don't want this feature to not work with
>>> CONFIG_CC_STACKPROTECTOR*. One way to fix that would be adding a GDT
>>> entry for gs so gs:40 points to the correct memory address and
>>> gs:[rip+XX] works correctly through the MSR.
>>
>> What are you talking about?  A GDT entry and the MSR do the same thing,
>> except that a GDT entry is limited to an offset of 0-0xffffffff (which
>> doesn't work for us, obviously.)
>>
>
> A GDT entry would allow gs:0x40 to be valid while all gs:[rip+XX]
> addresses uses the MSR.
>
> I didn't tested it but that was used on the RFG mitigation [1]. The fs
> segment register was used for both thread storage and shadow stack.
>
> [1] http://xlab.tencent.com/en/2016/11/02/return-flow-guard/
>

Small update on that.

I noticed that not only we have the problem of gs:0x40 not being
accessible. The compiler will default to the fs register if
mcmodel=kernel is not set.

On the next patch set, I am going to add support for
-mstack-protector-guard=global so a global variable can be used
instead of the segment register. Similar approach than ARM/ARM64.

Following this patch, I will work with gcc and llvm to add
-mstack-protector-reg=<segment register> support similar to PowerPC.
This way we can have gs used even without mcmodel=kernel. Once that's
an option, I can setup the GDT as described in the previous email
(similar to RFG).

Let me know what you think about this approach.

>>> Given the separate
>>> discussion on mcmodel, I am going first to check if we can move from
>>> PIE to PIC with a mcmodel=small or medium that would remove the percpu
>>> change requirement. I tried before without success but I understand
>>> better percpu and other components so maybe I can make it work.
>>
>>>> This is silly.  The right thing is for PIE is to be explicitly absolute,
>>>> without (%rip).  The use of (%rip) memory references for percpu is just
>>>> an optimization.
>>>
>>> I agree that it is odd but that's how the compiler generates code. I
>>> will re-explore PIC options with mcmodel=small or medium, as mentioned
>>> on other threads.
>>
>> Why should the way compiler generates code affect the way we do things
>> in assembly?
>>
>> That being said, the compiler now has support for generating this kind
>> of code explicitly via the __seg_gs pointer modifier.  That should let
>> us drop the __percpu_prefix and just use variables directly.  I suspect
>> we want to declare percpu variables as "volatile __seg_gs" to account
>> for the possibility of CPU switches.
>>
>> Older compilers won't be able to work with this, of course, but I think
>> that it is acceptable for those older compilers to not be able to
>> support PIE.
>>
>>         -hpa
>>
>
>
>
> --
> Thomas
Kees Cook Aug. 2, 2017, 4:56 p.m. UTC | #8
On Wed, Aug 2, 2017 at 9:42 AM, Thomas Garnier <thgarnie@google.com> wrote:
> I noticed that not only we have the problem of gs:0x40 not being
> accessible. The compiler will default to the fs register if
> mcmodel=kernel is not set.
>
> On the next patch set, I am going to add support for
> -mstack-protector-guard=global so a global variable can be used
> instead of the segment register. Similar approach than ARM/ARM64.

While this is probably understood, I have to point out that this would
be a major regression for the stack protection on x86.

> Following this patch, I will work with gcc and llvm to add
> -mstack-protector-reg=<segment register> support similar to PowerPC.
> This way we can have gs used even without mcmodel=kernel. Once that's
> an option, I can setup the GDT as described in the previous email
> (similar to RFG).

It would be much nicer if we could teach gcc about the percpu area
instead. This would let us solve the global stack protector problem on
the other architectures:
http://www.openwall.com/lists/kernel-hardening/2017/06/27/6

-Kees
Thomas Garnier Aug. 2, 2017, 6:05 p.m. UTC | #9
On Wed, Aug 2, 2017 at 9:56 AM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Aug 2, 2017 at 9:42 AM, Thomas Garnier <thgarnie@google.com> wrote:
>> I noticed that not only we have the problem of gs:0x40 not being
>> accessible. The compiler will default to the fs register if
>> mcmodel=kernel is not set.
>>
>> On the next patch set, I am going to add support for
>> -mstack-protector-guard=global so a global variable can be used
>> instead of the segment register. Similar approach than ARM/ARM64.
>
> While this is probably understood, I have to point out that this would
> be a major regression for the stack protection on x86.

I agree, the optimal solution will be using updated gcc/clang.

>
>> Following this patch, I will work with gcc and llvm to add
>> -mstack-protector-reg=<segment register> support similar to PowerPC.
>> This way we can have gs used even without mcmodel=kernel. Once that's
>> an option, I can setup the GDT as described in the previous email
>> (similar to RFG).
>
> It would be much nicer if we could teach gcc about the percpu area
> instead. This would let us solve the global stack protector problem on
> the other architectures:
> http://www.openwall.com/lists/kernel-hardening/2017/06/27/6

Yes, while I am looking at gcc I will take a look at other
architecture to see if I can help there too.

>
> -Kees
>
> --
> Kees Cook
> Pixel Security
diff mbox

Patch

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 691c4755269b..be198c0a2a8c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -388,7 +388,7 @@  ENTRY(__switch_to_asm)
 
 #ifdef CONFIG_CC_STACKPROTECTOR
 	movq	TASK_stack_canary(%rsi), %rbx
-	movq	%rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset
+	movq	%rbx, PER_CPU_VAR(irq_stack_union + stack_canary_offset)
 #endif
 
 	/* restore callee-saved registers */
@@ -739,7 +739,7 @@  apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
 /*
  * Exception entry points.
  */
-#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
+#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss + (TSS_ist + ((x) - 1) * 8))
 
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 9fa03604b2b3..862eb771f0e5 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -4,9 +4,11 @@ 
 #ifdef CONFIG_X86_64
 #define __percpu_seg		gs
 #define __percpu_mov_op		movq
+#define __percpu_rel		(%rip)
 #else
 #define __percpu_seg		fs
 #define __percpu_mov_op		movl
+#define __percpu_rel
 #endif
 
 #ifdef __ASSEMBLY__
@@ -27,10 +29,14 @@ 
 #define PER_CPU(var, reg)						\
 	__percpu_mov_op %__percpu_seg:this_cpu_off, reg;		\
 	lea var(reg), reg
-#define PER_CPU_VAR(var)	%__percpu_seg:var
+/* Compatible with Position Independent Code */
+#define PER_CPU_VAR(var)		%__percpu_seg:(var)##__percpu_rel
+/* Rare absolute reference */
+#define PER_CPU_VAR_ABS(var)		%__percpu_seg:var
 #else /* ! SMP */
 #define PER_CPU(var, reg)	__percpu_mov_op $var, reg
-#define PER_CPU_VAR(var)	var
+#define PER_CPU_VAR(var)	(var)##__percpu_rel
+#define PER_CPU_VAR_ABS(var)	var
 #endif	/* SMP */
 
 #ifdef CONFIG_X86_64_SMP
@@ -208,27 +214,34 @@  do {									\
 	pfo_ret__;					\
 })
 
+/* Position Independent code uses relative addresses only */
+#ifdef CONFIG_X86_PIE
+#define __percpu_stable_arg __percpu_arg(a1)
+#else
+#define __percpu_stable_arg __percpu_arg(P1)
+#endif
+
 #define percpu_stable_op(op, var)			\
 ({							\
 	typeof(var) pfo_ret__;				\
 	switch (sizeof(var)) {				\
 	case 1:						\
-		asm(op "b "__percpu_arg(P1)",%0"	\
+		asm(op "b "__percpu_stable_arg ",%0"	\
 		    : "=q" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
 	case 2:						\
-		asm(op "w "__percpu_arg(P1)",%0"	\
+		asm(op "w "__percpu_stable_arg ",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
 	case 4:						\
-		asm(op "l "__percpu_arg(P1)",%0"	\
+		asm(op "l "__percpu_stable_arg ",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
 	case 8:						\
-		asm(op "q "__percpu_arg(P1)",%0"	\
+		asm(op "q "__percpu_stable_arg ",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "p" (&(var)));			\
 		break;					\
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b95cd94ca97b..31300767ec0f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -480,7 +480,9 @@  void load_percpu_segment(int cpu)
 	loadsegment(fs, __KERNEL_PERCPU);
 #else
 	__loadsegment_simple(gs, 0);
-	wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));
+	wrmsrl(MSR_GS_BASE,
+	       (unsigned long)per_cpu(irq_stack_union.gs_base, cpu) -
+	       (unsigned long)__per_cpu_start);
 #endif
 	load_stack_canary_segment();
 }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 7e4f7a83a15a..4d0a7e68bfe8 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -256,7 +256,11 @@  ENDPROC(start_cpu0)
 	GLOBAL(initial_code)
 	.quad	x86_64_start_kernel
 	GLOBAL(initial_gs)
+#ifdef CONFIG_X86_PIE
+	.quad	0
+#else
 	.quad	INIT_PER_CPU_VAR(irq_stack_union)
+#endif
 	GLOBAL(initial_stack)
 	/*
 	 * The SIZEOF_PTREGS gap is a convention which helps the in-kernel
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 10edd1e69a68..ce1c58a29def 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -25,7 +25,7 @@ 
 DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
 EXPORT_PER_CPU_SYMBOL(cpu_number);
 
-#ifdef CONFIG_X86_64
+#if defined(CONFIG_X86_64) && !defined(CONFIG_X86_PIE)
 #define BOOT_PERCPU_OFFSET ((unsigned long)__per_cpu_load)
 #else
 #define BOOT_PERCPU_OFFSET 0
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index c8a3b61be0aa..77f1b0622539 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -183,9 +183,14 @@  SECTIONS
 	/*
 	 * percpu offsets are zero-based on SMP.  PERCPU_VADDR() changes the
 	 * output PHDR, so the next output section - .init.text - should
-	 * start another segment - init.
+	 * start another segment - init. For Position Independent Code, the
+	 * per-cpu section cannot be zero-based because everything is relative.
 	 */
+#ifdef CONFIG_X86_PIE
+	PERCPU_SECTION(INTERNODE_CACHE_BYTES)
+#else
 	PERCPU_VADDR(INTERNODE_CACHE_BYTES, 0, :percpu)
+#endif
 	ASSERT(SIZEOF(.data..percpu) < CONFIG_PHYSICAL_START,
 	       "per-CPU data too large - increase CONFIG_PHYSICAL_START")
 #endif
@@ -361,7 +366,11 @@  SECTIONS
  * Per-cpu symbols which need to be offset from __per_cpu_load
  * for the boot processor.
  */
+#ifdef CONFIG_X86_PIE
+#define INIT_PER_CPU(x) init_per_cpu__##x = x
+#else
 #define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
+#endif
 INIT_PER_CPU(gdt_page);
 INIT_PER_CPU(irq_stack_union);
 
@@ -371,7 +380,7 @@  INIT_PER_CPU(irq_stack_union);
 . = ASSERT((_end - _text <= KERNEL_IMAGE_SIZE),
 	   "kernel image bigger than KERNEL_IMAGE_SIZE");
 
-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) && !defined(CONFIG_X86_PIE)
 . = ASSERT((irq_stack_union == 0),
            "irq_stack_union is not at start of per-cpu area");
 #endif
diff --git a/arch/x86/lib/cmpxchg16b_emu.S b/arch/x86/lib/cmpxchg16b_emu.S
index 9b330242e740..254950604ae4 100644
--- a/arch/x86/lib/cmpxchg16b_emu.S
+++ b/arch/x86/lib/cmpxchg16b_emu.S
@@ -33,13 +33,13 @@  ENTRY(this_cpu_cmpxchg16b_emu)
 	pushfq
 	cli
 
-	cmpq PER_CPU_VAR((%rsi)), %rax
+	cmpq PER_CPU_VAR_ABS((%rsi)), %rax
 	jne .Lnot_same
-	cmpq PER_CPU_VAR(8(%rsi)), %rdx
+	cmpq PER_CPU_VAR_ABS(8(%rsi)), %rdx
 	jne .Lnot_same
 
-	movq %rbx, PER_CPU_VAR((%rsi))
-	movq %rcx, PER_CPU_VAR(8(%rsi))
+	movq %rbx, PER_CPU_VAR_ABS((%rsi))
+	movq %rcx, PER_CPU_VAR_ABS(8(%rsi))
 
 	popfq
 	mov $1, %al
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index eff224df813f..40410969fd3c 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -26,7 +26,7 @@ 
 ENTRY(xen_irq_enable_direct)
 	FRAME_BEGIN
 	/* Unmask events */
-	movb $0, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+	movb $0, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 
 	/*
 	 * Preempt here doesn't matter because that will deal with any
@@ -35,7 +35,7 @@  ENTRY(xen_irq_enable_direct)
 	 */
 
 	/* Test for pending */
-	testb $0xff, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_pending
+	testb $0xff, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_pending)
 	jz 1f
 
 2:	call check_events
@@ -52,7 +52,7 @@  ENDPATCH(xen_irq_enable_direct)
  * non-zero.
  */
 ENTRY(xen_irq_disable_direct)
-	movb $1, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+	movb $1, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 ENDPATCH(xen_irq_disable_direct)
 	ret
 	ENDPROC(xen_irq_disable_direct)
@@ -68,7 +68,7 @@  ENDPATCH(xen_irq_disable_direct)
  * x86 use opposite senses (mask vs enable).
  */
 ENTRY(xen_save_fl_direct)
-	testb $0xff, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+	testb $0xff, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 	setz %ah
 	addb %ah, %ah
 ENDPATCH(xen_save_fl_direct)
@@ -91,7 +91,7 @@  ENTRY(xen_restore_fl_direct)
 #else
 	testb $X86_EFLAGS_IF>>8, %ah
 #endif
-	setz PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+	setz PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
 	/*
 	 * Preempt here doesn't matter because that will deal with any
 	 * pending interrupts.  The pending check may end up being run
@@ -99,7 +99,7 @@  ENTRY(xen_restore_fl_direct)
 	 */
 
 	/* check for unmasked and pending */
-	cmpw $0x0001, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_pending
+	cmpw $0x0001, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_pending)
 	jnz 1f
 2:	call check_events
 1:
diff --git a/init/Kconfig b/init/Kconfig
index 8514b25db21c..4fb5d6fc2c4f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1201,7 +1201,7 @@  config KALLSYMS_ALL
 config KALLSYMS_ABSOLUTE_PERCPU
 	bool
 	depends on KALLSYMS
-	default X86_64 && SMP
+	default X86_64 && SMP && !X86_PIE
 
 config KALLSYMS_BASE_RELATIVE
 	bool