diff mbox

[RFC] x86/mm/KASLR: Remap GDTs at fixed location

Message ID 20170104221630.831-1-thgarnie@google.com
State New, archived
Headers show

Commit Message

Thomas Garnier Jan. 4, 2017, 10:16 p.m. UTC
Each processor holds a GDT in its per-cpu structure. The sgdt
instruction gives the base address of the current GDT. This address can
be used to bypass KASLR memory randomization. With another bug, an
attacker could target other per-cpu structures or deduce the base of the
main memory section (PAGE_OFFSET).

In this change, a space is reserved at the end of the memory range
available for KASLR memory randomization. The space is big enough to hold
the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is
mapped at specific offset based on the target CPU. Note that if there is
not enough space available, the GDTs are not remapped.

The document was changed to mention GDT remapping for KASLR. This patch
also include dump page tables support.

This patch was tested on multiple hardware configurations and for
hibernation support.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
---
Based on next-20170104
---
 arch/x86/include/asm/kaslr.h  |   4 ++
 arch/x86/kernel/cpu/common.c  |   7 ++-
 arch/x86/mm/dump_pagetables.c |  10 ++++
 arch/x86/mm/kaslr.c           | 107 +++++++++++++++++++++++++++++++++++++++++-
 kernel/cpu.c                  |   3 ++
 kernel/smp.c                  |   1 +
 6 files changed, 130 insertions(+), 2 deletions(-)

Comments

Ingo Molnar Jan. 5, 2017, 8:11 a.m. UTC | #1
* Thomas Garnier <thgarnie@google.com> wrote:

> Each processor holds a GDT in its per-cpu structure. The sgdt
> instruction gives the base address of the current GDT. This address can
> be used to bypass KASLR memory randomization. With another bug, an
> attacker could target other per-cpu structures or deduce the base of the
> main memory section (PAGE_OFFSET).
> 
> In this change, a space is reserved at the end of the memory range
> available for KASLR memory randomization. The space is big enough to hold
> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is
> mapped at specific offset based on the target CPU. Note that if there is
> not enough space available, the GDTs are not remapped.
> 
> The document was changed to mention GDT remapping for KASLR. This patch
> also include dump page tables support.
> 
> This patch was tested on multiple hardware configurations and for
> hibernation support.

>  void kernel_randomize_memory(void);
> +void kernel_randomize_smp(void);
> +void* kaslr_get_gdt_remap(int cpu);

Yeah, no fundamental objections from me to the principle, but I get some bad vibes 
from the naming here: seeing that kernel_randomize_smp() actually makes things 
less random.

Also, don't we want to do this unconditionally and not allow remapping failures? 

The GDT is fairly small, plus making the SGDT instruction expose fewer kernel 
internals would be (marginally) useful on non-randomized kernels as well.

It also makes the code more common, more predictable, more debuggable and less 
complex overall - which is pretty valuable in terms of long term security as well.

Thanks,

	Ingo
Arjan van de Ven Jan. 5, 2017, 3:08 p.m. UTC | #2
On 1/5/2017 12:11 AM, Ingo Molnar wrote:
>
> * Thomas Garnier <thgarnie@google.com> wrote:
>
>> Each processor holds a GDT in its per-cpu structure. The sgdt
>> instruction gives the base address of the current GDT. This address can
>> be used to bypass KASLR memory randomization. With another bug, an
>> attacker could target other per-cpu structures or deduce the base of the
>> main memory section (PAGE_OFFSET).
>>
>> In this change, a space is reserved at the end of the memory range
>> available for KASLR memory randomization. The space is big enough to hold
>> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is
>> mapped at specific offset based on the target CPU. Note that if there is
>> not enough space available, the GDTs are not remapped.
>>
>> The document was changed to mention GDT remapping for KASLR. This patch
>> also include dump page tables support.
>>
>> This patch was tested on multiple hardware configurations and for
>> hibernation support.
>
>>  void kernel_randomize_memory(void);
>> +void kernel_randomize_smp(void);
>> +void* kaslr_get_gdt_remap(int cpu);
>
> Yeah, no fundamental objections from me to the principle, but I get some bad vibes
> from the naming here: seeing that kernel_randomize_smp() actually makes things
> less random.
>

kernel_unrandomize_smp() ...

one request.. can we make sure this unrandomization is optional?
Thomas Garnier Jan. 5, 2017, 4:39 p.m. UTC | #3
On Thu, Jan 5, 2017 at 12:11 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Thomas Garnier <thgarnie@google.com> wrote:
>
>> Each processor holds a GDT in its per-cpu structure. The sgdt
>> instruction gives the base address of the current GDT. This address can
>> be used to bypass KASLR memory randomization. With another bug, an
>> attacker could target other per-cpu structures or deduce the base of the
>> main memory section (PAGE_OFFSET).
>>
>> In this change, a space is reserved at the end of the memory range
>> available for KASLR memory randomization. The space is big enough to hold
>> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is
>> mapped at specific offset based on the target CPU. Note that if there is
>> not enough space available, the GDTs are not remapped.
>>
>> The document was changed to mention GDT remapping for KASLR. This patch
>> also include dump page tables support.
>>
>> This patch was tested on multiple hardware configurations and for
>> hibernation support.
>
>>  void kernel_randomize_memory(void);
>> +void kernel_randomize_smp(void);
>> +void* kaslr_get_gdt_remap(int cpu);
>
> Yeah, no fundamental objections from me to the principle, but I get some bad vibes
> from the naming here: seeing that kernel_randomize_smp() actually makes things
> less random.
>

I agree, I went back and forth on the name. I will change it to
something better.

> Also, don't we want to do this unconditionally and not allow remapping failures?
>
> The GDT is fairly small, plus making the SGDT instruction expose fewer kernel
> internals would be (marginally) useful on non-randomized kernels as well.
>
> It also makes the code more common, more predictable, more debuggable and less
> complex overall - which is pretty valuable in terms of long term security as well.
>

Okay, I will add BUG_ON on failures to remap.

> Thanks,
>
>         Ingo

Ingo: I saw the 5-level page table support being sent through. Do you
want me to wait for it to be -next? (Given it will need to be changed
too).
Thomas Garnier Jan. 5, 2017, 4:40 p.m. UTC | #4
On Thu, Jan 5, 2017 at 7:08 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 1/5/2017 12:11 AM, Ingo Molnar wrote:
>>
>>
>> * Thomas Garnier <thgarnie@google.com> wrote:
>>
>>> Each processor holds a GDT in its per-cpu structure. The sgdt
>>> instruction gives the base address of the current GDT. This address can
>>> be used to bypass KASLR memory randomization. With another bug, an
>>> attacker could target other per-cpu structures or deduce the base of the
>>> main memory section (PAGE_OFFSET).
>>>
>>> In this change, a space is reserved at the end of the memory range
>>> available for KASLR memory randomization. The space is big enough to hold
>>> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is
>>> mapped at specific offset based on the target CPU. Note that if there is
>>> not enough space available, the GDTs are not remapped.
>>>
>>> The document was changed to mention GDT remapping for KASLR. This patch
>>> also include dump page tables support.
>>>
>>> This patch was tested on multiple hardware configurations and for
>>> hibernation support.
>>
>>
>>>  void kernel_randomize_memory(void);
>>> +void kernel_randomize_smp(void);
>>> +void* kaslr_get_gdt_remap(int cpu);
>>
>>
>> Yeah, no fundamental objections from me to the principle, but I get some
>> bad vibes
>> from the naming here: seeing that kernel_randomize_smp() actually makes
>> things
>> less random.
>>
>
> kernel_unrandomize_smp() ...
>

That seems like a better name.

> one request.. can we make sure this unrandomization is optional?
>

Well, it happens only when KASLR memory randomization is enabled. Do
you think it should have a separate config option?
Andy Lutomirski Jan. 5, 2017, 5:51 p.m. UTC | #5
On Wed, Jan 4, 2017 at 2:16 PM, Thomas Garnier <thgarnie@google.com> wrote:
> Each processor holds a GDT in its per-cpu structure. The sgdt
> instruction gives the base address of the current GDT. This address can
> be used to bypass KASLR memory randomization. With another bug, an
> attacker could target other per-cpu structures or deduce the base of the
> main memory section (PAGE_OFFSET).
>
> In this change, a space is reserved at the end of the memory range
> available for KASLR memory randomization. The space is big enough to hold
> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is
> mapped at specific offset based on the target CPU. Note that if there is
> not enough space available, the GDTs are not remapped.

Can we remap it read-only?  I.e. use PAGE_KERNEL_RO instead of
PAGE_KERNEL.  After all, the ability to modify the GDT is instant
root.
Thomas Garnier Jan. 5, 2017, 5:54 p.m. UTC | #6
On Thu, Jan 5, 2017 at 9:51 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Jan 4, 2017 at 2:16 PM, Thomas Garnier <thgarnie@google.com> wrote:
>> Each processor holds a GDT in its per-cpu structure. The sgdt
>> instruction gives the base address of the current GDT. This address can
>> be used to bypass KASLR memory randomization. With another bug, an
>> attacker could target other per-cpu structures or deduce the base of the
>> main memory section (PAGE_OFFSET).
>>
>> In this change, a space is reserved at the end of the memory range
>> available for KASLR memory randomization. The space is big enough to hold
>> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is
>> mapped at specific offset based on the target CPU. Note that if there is
>> not enough space available, the GDTs are not remapped.
>
> Can we remap it read-only?  I.e. use PAGE_KERNEL_RO instead of
> PAGE_KERNEL.  After all, the ability to modify the GDT is instant
> root.

That's my goal too. I started by doing a RO remap and got couple
problems with hibernation. I can try again for the next iteration or
delay it for another patch. I also need to look at KVM GDT usage, I am
not familiar with it yet.
Andy Lutomirski Jan. 5, 2017, 6:01 p.m. UTC | #7
On Thu, Jan 5, 2017 at 9:54 AM, Thomas Garnier <thgarnie@google.com> wrote:
> On Thu, Jan 5, 2017 at 9:51 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Jan 4, 2017 at 2:16 PM, Thomas Garnier <thgarnie@google.com> wrote:
>>> Each processor holds a GDT in its per-cpu structure. The sgdt
>>> instruction gives the base address of the current GDT. This address can
>>> be used to bypass KASLR memory randomization. With another bug, an
>>> attacker could target other per-cpu structures or deduce the base of the
>>> main memory section (PAGE_OFFSET).
>>>
>>> In this change, a space is reserved at the end of the memory range
>>> available for KASLR memory randomization. The space is big enough to hold
>>> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is
>>> mapped at specific offset based on the target CPU. Note that if there is
>>> not enough space available, the GDTs are not remapped.
>>
>> Can we remap it read-only?  I.e. use PAGE_KERNEL_RO instead of
>> PAGE_KERNEL.  After all, the ability to modify the GDT is instant
>> root.
>
> That's my goal too. I started by doing a RO remap and got couple
> problems with hibernation. I can try again for the next iteration or
> delay it for another patch. I also need to look at KVM GDT usage, I am
> not familiar with it yet.

If you want a small adventure, I think a significant KVM-related
performance improvement is available.  Specifically, on VMX exits, the
GDT limit is hardwired to 0xffff (IIRC -- I could be remembering the
actual vaue wrong).  KVM does LGDT to fix it.

If we actually made the GDT have limit 0xffff (presumably by mapping
the zero page a few times to pad it out without wasting memory), then
we would avoid the LGDT.  LGDT is incredibly slow, so this would be a
big win. Want to see if you can make this work with your patch set?
Thomas Garnier Jan. 5, 2017, 6:35 p.m. UTC | #8
On Thu, Jan 5, 2017 at 10:01 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jan 5, 2017 at 9:54 AM, Thomas Garnier <thgarnie@google.com> wrote:
>> On Thu, Jan 5, 2017 at 9:51 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Wed, Jan 4, 2017 at 2:16 PM, Thomas Garnier <thgarnie@google.com> wrote:
>>>> Each processor holds a GDT in its per-cpu structure. The sgdt
>>>> instruction gives the base address of the current GDT. This address can
>>>> be used to bypass KASLR memory randomization. With another bug, an
>>>> attacker could target other per-cpu structures or deduce the base of the
>>>> main memory section (PAGE_OFFSET).
>>>>
>>>> In this change, a space is reserved at the end of the memory range
>>>> available for KASLR memory randomization. The space is big enough to hold
>>>> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is
>>>> mapped at specific offset based on the target CPU. Note that if there is
>>>> not enough space available, the GDTs are not remapped.
>>>
>>> Can we remap it read-only?  I.e. use PAGE_KERNEL_RO instead of
>>> PAGE_KERNEL.  After all, the ability to modify the GDT is instant
>>> root.
>>
>> That's my goal too. I started by doing a RO remap and got couple
>> problems with hibernation. I can try again for the next iteration or
>> delay it for another patch. I also need to look at KVM GDT usage, I am
>> not familiar with it yet.
>
> If you want a small adventure, I think a significant KVM-related
> performance improvement is available.  Specifically, on VMX exits, the
> GDT limit is hardwired to 0xffff (IIRC -- I could be remembering the
> actual vaue wrong).  KVM does LGDT to fix it.
>
> If we actually made the GDT have limit 0xffff (presumably by mapping
> the zero page a few times to pad it out without wasting memory), then
> we would avoid the LGDT.  LGDT is incredibly slow, so this would be a
> big win. Want to see if you can make this work with your patch set?

I can always take a look. If you have any prototype or more details,
feel free to send it to me on a separate thread.
Arjan van de Ven Jan. 5, 2017, 6:56 p.m. UTC | #9
On 1/5/2017 8:40 AM, Thomas Garnier wrote:
> Well, it happens only when KASLR memory randomization is enabled. Do
> you think it should have a separate config option?

no I would want it a runtime option.... "sgdt from ring 3" is going away
with UMIP (and is already possibly gone in virtual machines, see
https://lwn.net/Articles/694385/) and for those cases it would be a shame
to lose the randomization
Arjan van de Ven Jan. 5, 2017, 6:58 p.m. UTC | #10
On 1/5/2017 9:54 AM, Thomas Garnier wrote:

>
> That's my goal too. I started by doing a RO remap and got couple
> problems with hibernation. I can try again for the next iteration or
> delay it for another patch. I also need to look at KVM GDT usage, I am
> not familiar with it yet.

don't we write to the GDT as part of the TLS segment stuff for glibc ?
Thomas Garnier Jan. 5, 2017, 7 p.m. UTC | #11
On Thu, Jan 5, 2017 at 10:56 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 1/5/2017 8:40 AM, Thomas Garnier wrote:
>>
>> Well, it happens only when KASLR memory randomization is enabled. Do
>> you think it should have a separate config option?
>
>
> no I would want it a runtime option.... "sgdt from ring 3" is going away
> with UMIP (and is already possibly gone in virtual machines, see
> https://lwn.net/Articles/694385/) and for those cases it would be a shame
> to lose the randomization
>

That's correct. When UMIP is enabled, we should disable fixed location
for both GDT and IDT. Glad to do that when UMIP support is added.
Thomas Garnier Jan. 5, 2017, 7:03 p.m. UTC | #12
On Thu, Jan 5, 2017 at 10:58 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 1/5/2017 9:54 AM, Thomas Garnier wrote:
>
>>
>> That's my goal too. I started by doing a RO remap and got couple
>> problems with hibernation. I can try again for the next iteration or
>> delay it for another patch. I also need to look at KVM GDT usage, I am
>> not familiar with it yet.
>
>
> don't we write to the GDT as part of the TLS segment stuff for glibc ?
>

Not sure which glibc feature it is.

In this design, you can write to the GDT per-cpu variable that will
remain read-write. You just need to make the remapping writeable when
we load task registers (ltr) then the processor use the current GDT
address. At least that the case I know, I might find more through
testing.
Andy Lutomirski Jan. 5, 2017, 8:18 p.m. UTC | #13
On Thu, Jan 5, 2017 at 11:03 AM, Thomas Garnier <thgarnie@google.com> wrote:
> On Thu, Jan 5, 2017 at 10:58 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>> On 1/5/2017 9:54 AM, Thomas Garnier wrote:
>>
>>>
>>> That's my goal too. I started by doing a RO remap and got couple
>>> problems with hibernation. I can try again for the next iteration or
>>> delay it for another patch. I also need to look at KVM GDT usage, I am
>>> not familiar with it yet.
>>
>>
>> don't we write to the GDT as part of the TLS segment stuff for glibc ?
>>
>
> Not sure which glibc feature it is.
>
> In this design, you can write to the GDT per-cpu variable that will
> remain read-write. You just need to make the remapping writeable when
> we load task registers (ltr) then the processor use the current GDT
> address. At least that the case I know, I might find more through
> testing.

Hmm.  I bet that if we preset the accessed bits in all the segments
then we don't need it to be writable in general.  But your point about
set_thread_area (TLS) is well taken.  However, I strongly suspect that
we could make set_thread_area unconditionally set the accessed bit and
no one would ever notice.
Thomas Garnier Jan. 5, 2017, 9:08 p.m. UTC | #14
On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Thu, Jan 5, 2017 at 11:03 AM, Thomas Garnier <thgarnie@google.com> wrote:
>> On Thu, Jan 5, 2017 at 10:58 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>>> On 1/5/2017 9:54 AM, Thomas Garnier wrote:
>>>
>>>>
>>>> That's my goal too. I started by doing a RO remap and got couple
>>>> problems with hibernation. I can try again for the next iteration or
>>>> delay it for another patch. I also need to look at KVM GDT usage, I am
>>>> not familiar with it yet.
>>>
>>>
>>> don't we write to the GDT as part of the TLS segment stuff for glibc ?
>>>
>>
>> Not sure which glibc feature it is.
>>
>> In this design, you can write to the GDT per-cpu variable that will
>> remain read-write. You just need to make the remapping writeable when
>> we load task registers (ltr) then the processor use the current GDT
>> address. At least that the case I know, I might find more through
>> testing.
>
> Hmm.  I bet that if we preset the accessed bits in all the segments
> then we don't need it to be writable in general.  But your point about
> set_thread_area (TLS) is well taken.  However, I strongly suspect that
> we could make set_thread_area unconditionally set the accessed bit and
> no one would ever notice.

Not sure I fully understood and I don't want to miss an important
point. Do you mean making GDT (remapping and per-cpu) read-only and
switch the writeable flag only when we write to the per-cpu entry?
Andy Lutomirski Jan. 5, 2017, 9:19 p.m. UTC | #15
On Thu, Jan 5, 2017 at 1:08 PM, Thomas Garnier <thgarnie@google.com> wrote:
> On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Thu, Jan 5, 2017 at 11:03 AM, Thomas Garnier <thgarnie@google.com> wrote:
>>> On Thu, Jan 5, 2017 at 10:58 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>>>> On 1/5/2017 9:54 AM, Thomas Garnier wrote:
>>>>
>>>>>
>>>>> That's my goal too. I started by doing a RO remap and got couple
>>>>> problems with hibernation. I can try again for the next iteration or
>>>>> delay it for another patch. I also need to look at KVM GDT usage, I am
>>>>> not familiar with it yet.
>>>>
>>>>
>>>> don't we write to the GDT as part of the TLS segment stuff for glibc ?
>>>>
>>>
>>> Not sure which glibc feature it is.
>>>
>>> In this design, you can write to the GDT per-cpu variable that will
>>> remain read-write. You just need to make the remapping writeable when
>>> we load task registers (ltr) then the processor use the current GDT
>>> address. At least that the case I know, I might find more through
>>> testing.
>>
>> Hmm.  I bet that if we preset the accessed bits in all the segments
>> then we don't need it to be writable in general.  But your point about
>> set_thread_area (TLS) is well taken.  However, I strongly suspect that
>> we could make set_thread_area unconditionally set the accessed bit and
>> no one would ever notice.
>
> Not sure I fully understood and I don't want to miss an important
> point. Do you mean making GDT (remapping and per-cpu) read-only and
> switch the writeable flag only when we write to the per-cpu entry?
>

What I mean is: write to the GDT through normal percpu access (or
whatever the normal mapping is) but load a read-only alias into the
GDT register.  As long as nothing ever tries to write through the GDTR
alias, no page faults will be generated.  So we just need to make sure
that nothing ever writes to it through GDTR.  AFAIK the only reason
the CPU ever writes to the address in GDTR is to set an accessed bit.

--Andy
Thomas Garnier Jan. 5, 2017, 9:58 p.m. UTC | #16
On Thu, Jan 5, 2017 at 1:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jan 5, 2017 at 1:08 PM, Thomas Garnier <thgarnie@google.com> wrote:
>> On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> On Thu, Jan 5, 2017 at 11:03 AM, Thomas Garnier <thgarnie@google.com> wrote:
>>>> On Thu, Jan 5, 2017 at 10:58 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>>>>> On 1/5/2017 9:54 AM, Thomas Garnier wrote:
>>>>>
>>>>>>
>>>>>> That's my goal too. I started by doing a RO remap and got couple
>>>>>> problems with hibernation. I can try again for the next iteration or
>>>>>> delay it for another patch. I also need to look at KVM GDT usage, I am
>>>>>> not familiar with it yet.
>>>>>
>>>>>
>>>>> don't we write to the GDT as part of the TLS segment stuff for glibc ?
>>>>>
>>>>
>>>> Not sure which glibc feature it is.
>>>>
>>>> In this design, you can write to the GDT per-cpu variable that will
>>>> remain read-write. You just need to make the remapping writeable when
>>>> we load task registers (ltr) then the processor use the current GDT
>>>> address. At least that the case I know, I might find more through
>>>> testing.
>>>
>>> Hmm.  I bet that if we preset the accessed bits in all the segments
>>> then we don't need it to be writable in general.  But your point about
>>> set_thread_area (TLS) is well taken.  However, I strongly suspect that
>>> we could make set_thread_area unconditionally set the accessed bit and
>>> no one would ever notice.
>>
>> Not sure I fully understood and I don't want to miss an important
>> point. Do you mean making GDT (remapping and per-cpu) read-only and
>> switch the writeable flag only when we write to the per-cpu entry?
>>
>
> What I mean is: write to the GDT through normal percpu access (or
> whatever the normal mapping is) but load a read-only alias into the
> GDT register.  As long as nothing ever tries to write through the GDTR
> alias, no page faults will be generated.  So we just need to make sure
> that nothing ever writes to it through GDTR.  AFAIK the only reason
> the CPU ever writes to the address in GDTR is to set an accessed bit.
>

A write is made when we use load_TR_desc (ltr). I didn't see any other yet.

> --Andy
Linus Torvalds Jan. 5, 2017, 11:05 p.m. UTC | #17
On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> Hmm.  I bet that if we preset the accessed bits in all the segments
> then we don't need it to be writable in general.

I'm not sure that this is architecturally safe.

IIRC, we do mark the IDT read-only - but that one we started doing due
to the f00f bug, so we knew it was ok. I'm not sure you can do the
same with the GDT/LDT.

                   Linus
Thomas Garnier Jan. 5, 2017, 11:16 p.m. UTC | #18
On Thu, Jan 5, 2017 at 3:05 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> Hmm.  I bet that if we preset the accessed bits in all the segments
>> then we don't need it to be writable in general.
>
> I'm not sure that this is architecturally safe.
>
> IIRC, we do mark the IDT read-only - but that one we started doing due
> to the f00f bug, so we knew it was ok. I'm not sure you can do the
> same with the GDT/LDT.
>

I started testing a variant that make the GDT remapping read-only by
default and writeable only for LTR. Everything works fine, even
hibernation. I need to do more testing though on different
architectures.

To be on the safe side, I could separate the read-only part in a
separate patch so we can easily remove it if extended testing show
something.

>                    Linus
Andy Lutomirski Jan. 6, 2017, 2:34 a.m. UTC | #19
On Thu, Jan 5, 2017 at 3:05 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> Hmm.  I bet that if we preset the accessed bits in all the segments
>> then we don't need it to be writable in general.
>
> I'm not sure that this is architecturally safe.
>

Hmm.  Last time I looked, I couldn't find *anything* in the SDM
explaining what happened if a GDT access resulted in a page fault.  I
did discover that Xen intentionally (!) lazily populates and maps LDT
pages.  An attempt to access a not-present page results in #PF with
the error cod e indicating kernel access even if the access came from
user mode.

SDM volume 3 7.2.2 says "Pages corresponding to the previous task’s
TSS, the current task’s TSS, and the descriptor table entries for
each all should be marked as read/write."  But I don't see how a CPU
implementation could possibly care what the page table for the TSS
descriptor table entries says after LTR is done because the CPU isn't
even supposed to *read* that memory.

OTOH a valid implementation could easily require that the page table
says that the page is writable merely to load a segment, especially in
weird cases (IRET?).  That being said, this is all quite easy to test.

Also, Thomas, why are you creating a new memory region?  I don't see
any benefit to randomizing the GDT address.  How about just putting it
in the fixmap?  This  would be NR_CPUS * 4 pages if do my limit=0xffff
idea.  I'm not sure if the fixmap code knows how to handle this much
space.
Ingo Molnar Jan. 6, 2017, 6:34 a.m. UTC | #20
* Thomas Garnier <thgarnie@google.com> wrote:

> > Thanks,
> >
> >         Ingo
> 
> Ingo: I saw the 5-level page table support being sent through. Do you
> want me to wait for it to be -next? (Given it will need to be changed
> too).

Please just base your bits on Linus's latest tree - we'll sort out any
conflicts as/when they happen.

Thanks,

	Ingo
Ingo Molnar Jan. 6, 2017, 6:45 a.m. UTC | #21
* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On 1/5/2017 9:54 AM, Thomas Garnier wrote:
> 
> > That's my goal too. I started by doing a RO remap and got couple problems with 
> > hibernation. I can try again for the next iteration or delay it for another 
> > patch. I also need to look at KVM GDT usage, I am not familiar with it yet.
> 
> don't we write to the GDT as part of the TLS segment stuff for glibc ?

Yes - but the idea would be to have two virtual memory aliases:

 1) A world-readable RO mapping of the GDT put into the fixmap area (randomization
    is not really required as the SGDT instruction exposes it anyway).

    This read-only GDT mapping has two advantages:

     - Because the GDT address is read-only it cannot be used by exploits as an 
       easy trampoline for 100% safe, deterministic rooting of the system from a 
       local process.

     - Even on older CPUs the SGDT instruction won't expose the absolute address 
       of critical kernel data structures.

 2) A kernel-only RW mapping in the linear kernel memory map where the original
    page is - randomized (as a natural part of kernel physical and virtual memory
    randomization) and only known and accessible to the kernel, SGDT doesn't 
    expose it even on older CPUs.

This way we IMHO have the best of both worlds.

Assuming there are no strange CPU errata surfacing if we use a RO GDT - OTOH that 
should be fairly easy to test for, and worst-case we could quirk-off the feature 
on CPUs where this cannot be done safely.

Thanks,

	Ingo
Ingo Molnar Jan. 6, 2017, 6:49 a.m. UTC | #22
* Thomas Garnier <thgarnie@google.com> wrote:

> >> Not sure I fully understood and I don't want to miss an important point. Do 
> >> you mean making GDT (remapping and per-cpu) read-only and switch the 
> >> writeable flag only when we write to the per-cpu entry?
> >
> > What I mean is: write to the GDT through normal percpu access (or whatever the 
> > normal mapping is) but load a read-only alias into the GDT register.  As long 
> > as nothing ever tries to write through the GDTR alias, no page faults will be 
> > generated.  So we just need to make sure that nothing ever writes to it 
> > through GDTR.  AFAIK the only reason the CPU ever writes to the address in 
> > GDTR is to set an accessed bit.
> 
> A write is made when we use load_TR_desc (ltr). I didn't see any other yet.

Is this write to the GDT, generated by the LTR instruction, done unconditionally 
by the hardware?

Thanks,

	Ingo
Borislav Petkov Jan. 6, 2017, 5:44 p.m. UTC | #23
On Thu, Jan 05, 2017 at 08:40:29AM -0800, Thomas Garnier wrote:
> > kernel_unrandomize_smp() ...
> >
> 
> That seems like a better name.

Hardly... I'd call it something like kaslr_load_gdt() to actually say
what this function is doing.
Thomas Garnier Jan. 6, 2017, 6:02 p.m. UTC | #24
On Thu, Jan 5, 2017 at 6:34 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jan 5, 2017 at 3:05 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> Hmm.  I bet that if we preset the accessed bits in all the segments
>>> then we don't need it to be writable in general.
>>
>> I'm not sure that this is architecturally safe.
>>
>
> Hmm.  Last time I looked, I couldn't find *anything* in the SDM
> explaining what happened if a GDT access resulted in a page fault.  I
> did discover that Xen intentionally (!) lazily populates and maps LDT
> pages.  An attempt to access a not-present page results in #PF with
> the error cod e indicating kernel access even if the access came from
> user mode.
>
> SDM volume 3 7.2.2 says "Pages corresponding to the previous task’s
> TSS, the current task’s TSS, and the descriptor table entries for
> each all should be marked as read/write."  But I don't see how a CPU
> implementation could possibly care what the page table for the TSS
> descriptor table entries says after LTR is done because the CPU isn't
> even supposed to *read* that memory.
>
> OTOH a valid implementation could easily require that the page table
> says that the page is writable merely to load a segment, especially in
> weird cases (IRET?).  That being said, this is all quite easy to test.
>
> Also, Thomas, why are you creating a new memory region?  I don't see
> any benefit to randomizing the GDT address.  How about just putting it
> in the fixmap?  This  would be NR_CPUS * 4 pages if do my limit=0xffff
> idea.  I'm not sure if the fixmap code knows how to handle this much
> space.

When I looked at the fixmap, you had to define the space you need
ahead of time and I am not sure there was enough space as you said.
Thomas Garnier Jan. 6, 2017, 6:03 p.m. UTC | #25
On Thu, Jan 5, 2017 at 10:49 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Thomas Garnier <thgarnie@google.com> wrote:
>
>> >> Not sure I fully understood and I don't want to miss an important point. Do
>> >> you mean making GDT (remapping and per-cpu) read-only and switch the
>> >> writeable flag only when we write to the per-cpu entry?
>> >
>> > What I mean is: write to the GDT through normal percpu access (or whatever the
>> > normal mapping is) but load a read-only alias into the GDT register.  As long
>> > as nothing ever tries to write through the GDTR alias, no page faults will be
>> > generated.  So we just need to make sure that nothing ever writes to it
>> > through GDTR.  AFAIK the only reason the CPU ever writes to the address in
>> > GDTR is to set an accessed bit.
>>
>> A write is made when we use load_TR_desc (ltr). I didn't see any other yet.
>
> Is this write to the GDT, generated by the LTR instruction, done unconditionally
> by the hardware?
>

That was my experience. I didn't look into details. Do you think we
could change something so that ltr never writes to the GDT? (just mark
the TSS entry busy).

> Thanks,
>
>         Ingo
Thomas Garnier Jan. 6, 2017, 6:04 p.m. UTC | #26
On Fri, Jan 6, 2017 at 9:44 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Thu, Jan 05, 2017 at 08:40:29AM -0800, Thomas Garnier wrote:
>> > kernel_unrandomize_smp() ...
>> >
>>
>> That seems like a better name.
>
> Hardly... I'd call it something like kaslr_load_gdt() to actually say
> what this function is doing.
>

True, it loads it only for the main processor though. I will try a
variant like kaslr_load_main_gdt().

> --
> Regards/Gruss,
>     Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.
Andy Lutomirski Jan. 6, 2017, 9:53 p.m. UTC | #27
On Fri, Jan 6, 2017 at 10:02 AM, Thomas Garnier <thgarnie@google.com> wrote:
> On Thu, Jan 5, 2017 at 6:34 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Jan 5, 2017 at 3:05 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>
>>>> Hmm.  I bet that if we preset the accessed bits in all the segments
>>>> then we don't need it to be writable in general.
>>>
>>> I'm not sure that this is architecturally safe.
>>>
>>
>> Hmm.  Last time I looked, I couldn't find *anything* in the SDM
>> explaining what happened if a GDT access resulted in a page fault.  I
>> did discover that Xen intentionally (!) lazily populates and maps LDT
>> pages.  An attempt to access a not-present page results in #PF with
>> the error cod e indicating kernel access even if the access came from
>> user mode.
>>
>> SDM volume 3 7.2.2 says "Pages corresponding to the previous task’s
>> TSS, the current task’s TSS, and the descriptor table entries for
>> each all should be marked as read/write."  But I don't see how a CPU
>> implementation could possibly care what the page table for the TSS
>> descriptor table entries says after LTR is done because the CPU isn't
>> even supposed to *read* that memory.
>>
>> OTOH a valid implementation could easily require that the page table
>> says that the page is writable merely to load a segment, especially in
>> weird cases (IRET?).  That being said, this is all quite easy to test.
>>
>> Also, Thomas, why are you creating a new memory region?  I don't see
>> any benefit to randomizing the GDT address.  How about just putting it
>> in the fixmap?  This  would be NR_CPUS * 4 pages if do my limit=0xffff
>> idea.  I'm not sure if the fixmap code knows how to handle this much
>> space.
>
> When I looked at the fixmap, you had to define the space you need
> ahead of time and I am not sure there was enough space as you said.

Can you try it and see if anything goes wrong?  Even if something does
go wrong, I think we should fix *that* rather than making the memory
layout more complicated.
Andy Lutomirski Jan. 6, 2017, 9:59 p.m. UTC | #28
On Fri, Jan 6, 2017 at 10:03 AM, Thomas Garnier <thgarnie@google.com> wrote:
> On Thu, Jan 5, 2017 at 10:49 PM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Thomas Garnier <thgarnie@google.com> wrote:
>>
>>> >> Not sure I fully understood and I don't want to miss an important point. Do
>>> >> you mean making GDT (remapping and per-cpu) read-only and switch the
>>> >> writeable flag only when we write to the per-cpu entry?
>>> >
>>> > What I mean is: write to the GDT through normal percpu access (or whatever the
>>> > normal mapping is) but load a read-only alias into the GDT register.  As long
>>> > as nothing ever tries to write through the GDTR alias, no page faults will be
>>> > generated.  So we just need to make sure that nothing ever writes to it
>>> > through GDTR.  AFAIK the only reason the CPU ever writes to the address in
>>> > GDTR is to set an accessed bit.
>>>
>>> A write is made when we use load_TR_desc (ltr). I didn't see any other yet.
>>
>> Is this write to the GDT, generated by the LTR instruction, done unconditionally
>> by the hardware?
>>
>
> That was my experience. I didn't look into details. Do you think we
> could change something so that ltr never writes to the GDT? (just mark
> the TSS entry busy).

No, and I had the way this worked on 64-bit wrong.  LTR requires an
available TSS and changes it to busy.  So here are my thoughts on how
this should work:

Let's get rid of any connection between this code and KASLR.  Every
time KASLR makes something work differently, a kitten turns all
Schrödinger on us.  This is moving the GDT to the fixmap, plain and
simple.  For now, make it one page per CPU and don't worry about the
GDT limit.

On 32-bit, we're going to have to make the fixmap GDT be read-write
because making it read-only will break double-fault handling.

On 64-bit, we can use your trick of temporarily mapping the GDT
read-write every time we load TR, which should happen very rarely.
Alternatively, we can reload the *GDT* every time we reload TR, which
should be comparably slow.  This is going to regress performance in
the extremely rare case where KVM exits to a process that uses
ioperm() (I think), but I doubt anyone cares.  Or maybe we could
arrange to never reload TR when GDT points at the fixmap by having KVM
set the host GDT to the direct version and letting KVM's code to
reload the GDT switch to the fixmap copy.

If we need a quirk to keep the fixmap copy read-write, so be it.

None of this should depend on KASLR.  IMO it should happen unconditionally.

Once all if it works, then we can build on it to allocate four pages
per CPU (with the extra three pointing to the zero page) and speeding
up KVM.

--Andy
Thomas Garnier Jan. 6, 2017, 10:54 p.m. UTC | #29
On Fri, Jan 6, 2017 at 1:59 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Jan 6, 2017 at 10:03 AM, Thomas Garnier <thgarnie@google.com> wrote:
>> On Thu, Jan 5, 2017 at 10:49 PM, Ingo Molnar <mingo@kernel.org> wrote:
>>>
>>> * Thomas Garnier <thgarnie@google.com> wrote:
>>>
>>>> >> Not sure I fully understood and I don't want to miss an important point. Do
>>>> >> you mean making GDT (remapping and per-cpu) read-only and switch the
>>>> >> writeable flag only when we write to the per-cpu entry?
>>>> >
>>>> > What I mean is: write to the GDT through normal percpu access (or whatever the
>>>> > normal mapping is) but load a read-only alias into the GDT register.  As long
>>>> > as nothing ever tries to write through the GDTR alias, no page faults will be
>>>> > generated.  So we just need to make sure that nothing ever writes to it
>>>> > through GDTR.  AFAIK the only reason the CPU ever writes to the address in
>>>> > GDTR is to set an accessed bit.
>>>>
>>>> A write is made when we use load_TR_desc (ltr). I didn't see any other yet.
>>>
>>> Is this write to the GDT, generated by the LTR instruction, done unconditionally
>>> by the hardware?
>>>
>>
>> That was my experience. I didn't look into details. Do you think we
>> could change something so that ltr never writes to the GDT? (just mark
>> the TSS entry busy).
>
> No, and I had the way this worked on 64-bit wrong.  LTR requires an
> available TSS and changes it to busy.  So here are my thoughts on how
> this should work:
>
> Let's get rid of any connection between this code and KASLR.  Every
> time KASLR makes something work differently, a kitten turns all
> Schrödinger on us.  This is moving the GDT to the fixmap, plain and
> simple.  For now, make it one page per CPU and don't worry about the
> GDT limit.

I am all for this change but that's more significant.

Ingo: What do you think about that?

>
> On 32-bit, we're going to have to make the fixmap GDT be read-write
> because making it read-only will break double-fault handling.
>
> On 64-bit, we can use your trick of temporarily mapping the GDT
> read-write every time we load TR, which should happen very rarely.
> Alternatively, we can reload the *GDT* every time we reload TR, which
> should be comparably slow.  This is going to regress performance in
> the extremely rare case where KVM exits to a process that uses
> ioperm() (I think), but I doubt anyone cares.  Or maybe we could
> arrange to never reload TR when GDT points at the fixmap by having KVM
> set the host GDT to the direct version and letting KVM's code to
> reload the GDT switch to the fixmap copy.
>
> If we need a quirk to keep the fixmap copy read-write, so be it.
>
> None of this should depend on KASLR.  IMO it should happen unconditionally.
>

I looked back at the fixmap, and I can see a way it could be done
(using NR_CPUS) like the other fixmap ranges. It would limit the
number of cpus to 512 (there is 2M memory left on fixmap on the
default configuration). That's if we never add any other fixmap on
x64. I don't know if it is an acceptable number and if the fixmap
region could be increased. (128 if we do your kvm trick, of course).

Ingo: What do you think?

> Once all if it works, then we can build on it to allocate four pages
> per CPU (with the extra three pointing to the zero page) and speeding
> up KVM.
>
> --Andy
Andy Lutomirski Jan. 6, 2017, 11:39 p.m. UTC | #30
On Fri, Jan 6, 2017 at 2:54 PM, Thomas Garnier <thgarnie@google.com> wrote:
> On Fri, Jan 6, 2017 at 1:59 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Fri, Jan 6, 2017 at 10:03 AM, Thomas Garnier <thgarnie@google.com> wrote:
>>> On Thu, Jan 5, 2017 at 10:49 PM, Ingo Molnar <mingo@kernel.org> wrote:
>>>>
>>>> * Thomas Garnier <thgarnie@google.com> wrote:
>>>>
>>>>> >> Not sure I fully understood and I don't want to miss an important point. Do
>>>>> >> you mean making GDT (remapping and per-cpu) read-only and switch the
>>>>> >> writeable flag only when we write to the per-cpu entry?
>>>>> >
>>>>> > What I mean is: write to the GDT through normal percpu access (or whatever the
>>>>> > normal mapping is) but load a read-only alias into the GDT register.  As long
>>>>> > as nothing ever tries to write through the GDTR alias, no page faults will be
>>>>> > generated.  So we just need to make sure that nothing ever writes to it
>>>>> > through GDTR.  AFAIK the only reason the CPU ever writes to the address in
>>>>> > GDTR is to set an accessed bit.
>>>>>
>>>>> A write is made when we use load_TR_desc (ltr). I didn't see any other yet.
>>>>
>>>> Is this write to the GDT, generated by the LTR instruction, done unconditionally
>>>> by the hardware?
>>>>
>>>
>>> That was my experience. I didn't look into details. Do you think we
>>> could change something so that ltr never writes to the GDT? (just mark
>>> the TSS entry busy).
>>
>> No, and I had the way this worked on 64-bit wrong.  LTR requires an
>> available TSS and changes it to busy.  So here are my thoughts on how
>> this should work:
>>
>> Let's get rid of any connection between this code and KASLR.  Every
>> time KASLR makes something work differently, a kitten turns all
>> Schrödinger on us.  This is moving the GDT to the fixmap, plain and
>> simple.  For now, make it one page per CPU and don't worry about the
>> GDT limit.
>
> I am all for this change but that's more significant.
>
> Ingo: What do you think about that?
>
>>
>> On 32-bit, we're going to have to make the fixmap GDT be read-write
>> because making it read-only will break double-fault handling.
>>
>> On 64-bit, we can use your trick of temporarily mapping the GDT
>> read-write every time we load TR, which should happen very rarely.
>> Alternatively, we can reload the *GDT* every time we reload TR, which
>> should be comparably slow.  This is going to regress performance in
>> the extremely rare case where KVM exits to a process that uses
>> ioperm() (I think), but I doubt anyone cares.  Or maybe we could
>> arrange to never reload TR when GDT points at the fixmap by having KVM
>> set the host GDT to the direct version and letting KVM's code to
>> reload the GDT switch to the fixmap copy.
>>
>> If we need a quirk to keep the fixmap copy read-write, so be it.
>>
>> None of this should depend on KASLR.  IMO it should happen unconditionally.
>>
>
> I looked back at the fixmap, and I can see a way it could be done
> (using NR_CPUS) like the other fixmap ranges. It would limit the
> number of cpus to 512 (there is 2M memory left on fixmap on the
> default configuration). That's if we never add any other fixmap on
> x64. I don't know if it is an acceptable number and if the fixmap
> region could be increased. (128 if we do your kvm trick, of course).
>

IIRC we need 4096 CPUs.  But that 2M limit seems eminently fixable.  I
just tried sticking 4096 pages of nothing right near the top of the
fixmap and the only problem I saw was that I had to move MODULES_END
down a little bit.

--Andy

P.S. Let's do the move to the fixmap, read/write as a separate patch.
That will make bisecting much easier.
Ingo Molnar Jan. 7, 2017, 7:35 a.m. UTC | #31
* Thomas Garnier <thgarnie@google.com> wrote:

> > No, and I had the way this worked on 64-bit wrong.  LTR requires an
> > available TSS and changes it to busy.  So here are my thoughts on how
> > this should work:
> >
> > Let's get rid of any connection between this code and KASLR.  Every
> > time KASLR makes something work differently, a kitten turns all
> > Schrödinger on us.  This is moving the GDT to the fixmap, plain and
> > simple.  For now, make it one page per CPU and don't worry about the
> > GDT limit.
> 
> I am all for this change but that's more significant.
> 
> Ingo: What do you think about that?

I agree with Andy: as I alluded to earlier as well this should be an unconditional 
change (tested properly, etc.) that robustifies the GDT mapping for everyone. That 
KASLR kernels improve too is a happy side effect!

> > On 32-bit, we're going to have to make the fixmap GDT be read-write because 
> > making it read-only will break double-fault handling.
> >
> > On 64-bit, we can use your trick of temporarily mapping the GDT read-write 
> > every time we load TR, which should happen very rarely. Alternatively, we can 
> > reload the *GDT* every time we reload TR, which should be comparably slow.  
> > This is going to regress performance in the extremely rare case where KVM 
> > exits to a process that uses ioperm() (I think), but I doubt anyone cares.  Or 
> > maybe we could arrange to never reload TR when GDT points at the fixmap by 
> > having KVM set the host GDT to the direct version and letting KVM's code to 
> > reload the GDT switch to the fixmap copy.

Please check whether the LTR write generates a page fault to a RO PTE even if the 
busy bit is already set. LTR is pretty slow which suggests that it's microcode, 
and microcode is usually not sloppy about such things: i.e. LTR would only 
generate an unconditional write if there's a compatibility dependency on it. But I 
could easily be wrong ...

> > If we need a quirk to keep the fixmap copy read-write, so be it.
> >
> > None of this should depend on KASLR.  IMO it should happen unconditionally.
> 
> I looked back at the fixmap, and I can see a way it could be done
> (using NR_CPUS) like the other fixmap ranges. It would limit the
> number of cpus to 512 (there is 2M memory left on fixmap on the
> default configuration). That's if we never add any other fixmap on
> x64. I don't know if it is an acceptable number and if the fixmap
> region could be increased. (128 if we do your kvm trick, of course).
> 
> Ingo: What do you think?

I think we should scale the fixmap size flexibly with NR_CPUs on 64-bit, and we 
should limit CPUs on 32-bit to a reasonable value.

I.e. let's just do it, if we run into problems it's all solvable AFAICS.

Thanks,

	Ingo
Ingo Molnar Jan. 7, 2017, 7:45 a.m. UTC | #32
* Andy Lutomirski <luto@kernel.org> wrote:

> > I looked back at the fixmap, and I can see a way it could be done (using 
> > NR_CPUS) like the other fixmap ranges. It would limit the number of cpus to 
> > 512 (there is 2M memory left on fixmap on the default configuration). That's 
> > if we never add any other fixmap on x64. I don't know if it is an acceptable 
> > number and if the fixmap region could be increased. (128 if we do your kvm 
> > trick, of course).
> 
> IIRC we need 4096 CPUs.

On 64-bit the limit is 8192 CPUs, and the SGI guys are relying on that up to the 
tune of 6144 cores already, and I'd say the 64-bit CPU count is likely to go up 
further with 5-level paging.

On 32-bit the reasonable CPU limit is the number that the Intel 32-bit cluster 
computing nodes use. The latest public numbers are I think 36 'tiles' with each 
tile being a 2-CPU SMT core - i.e. a limit of 72 CPUs has to be maintained. 
(They'll obviously go to 64-bit as well so this problem will go away in a hardware 
generation or two.)

So I'd say 128 CPUs on 32-bit should be a reasonable practical limit going 
forward. Right now our 32-bit limit is 512 CPUs IIRC, but I don't think any real 
hardware in production is reaching that.

> P.S. Let's do the move to the fixmap, read/write as a separate patch. That will 
> make bisecting much easier.

Absolutely, but this has to be within the same series, as the interim fixmap-only 
step is less secure in some circumstances: we are moving the writable GDT from a 
previously randomized location to a fixed location.

Thanks,

	Ingo
Ingo Molnar Jan. 7, 2017, 7:46 a.m. UTC | #33
* Andy Lutomirski <luto@kernel.org> wrote:

> > When I looked at the fixmap, you had to define the space you need ahead of 
> > time and I am not sure there was enough space as you said.
> 
> Can you try it and see if anything goes wrong?  Even if something does go wrong, 
> I think we should fix *that* rather than making the memory layout more 
> complicated.

Absolutely! This should always be the driving principle when complicating the 
kernel's memory layout.

Thanks,

	Ingo
Andy Lutomirski Jan. 7, 2017, 3:58 p.m. UTC | #34
On Fri, Jan 6, 2017 at 11:45 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:

>> P.S. Let's do the move to the fixmap, read/write as a separate patch. That will
>> make bisecting much easier.
>
> Absolutely, but this has to be within the same series, as the interim fixmap-only
> step is less secure in some circumstances: we are moving the writable GDT from a
> previously randomized location to a fixed location.

True, but despite being randomized its location was never even
remotely secret.  (Except on Kaby Lake or Foobar Lake or whatever CPU
that is.)

--Andy
Andy Lutomirski Jan. 7, 2017, 4:02 p.m. UTC | #35
On Fri, Jan 6, 2017 at 11:35 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Thomas Garnier <thgarnie@google.com> wrote:
>
>> > No, and I had the way this worked on 64-bit wrong.  LTR requires an
>> > available TSS and changes it to busy.  So here are my thoughts on how
>> > this should work:
>> >
>> > Let's get rid of any connection between this code and KASLR.  Every
>> > time KASLR makes something work differently, a kitten turns all
>> > Schrödinger on us.  This is moving the GDT to the fixmap, plain and
>> > simple.  For now, make it one page per CPU and don't worry about the
>> > GDT limit.
>>
>> I am all for this change but that's more significant.
>>
>> Ingo: What do you think about that?
>
> I agree with Andy: as I alluded to earlier as well this should be an unconditional
> change (tested properly, etc.) that robustifies the GDT mapping for everyone. That
> KASLR kernels improve too is a happy side effect!
>
>> > On 32-bit, we're going to have to make the fixmap GDT be read-write because
>> > making it read-only will break double-fault handling.
>> >
>> > On 64-bit, we can use your trick of temporarily mapping the GDT read-write
>> > every time we load TR, which should happen very rarely. Alternatively, we can
>> > reload the *GDT* every time we reload TR, which should be comparably slow.
>> > This is going to regress performance in the extremely rare case where KVM
>> > exits to a process that uses ioperm() (I think), but I doubt anyone cares.  Or
>> > maybe we could arrange to never reload TR when GDT points at the fixmap by
>> > having KVM set the host GDT to the direct version and letting KVM's code to
>> > reload the GDT switch to the fixmap copy.
>
> Please check whether the LTR write generates a page fault to a RO PTE even if the
> busy bit is already set. LTR is pretty slow which suggests that it's microcode,
> and microcode is usually not sloppy about such things: i.e. LTR would only
> generate an unconditional write if there's a compatibility dependency on it. But I
> could easily be wrong ...

The SDM says:

IF segment descriptor is not for an available TSS
THEN #GP(segment selector); FI;

so I think it's #GP not #PF.

>
>> > If we need a quirk to keep the fixmap copy read-write, so be it.
>> >
>> > None of this should depend on KASLR.  IMO it should happen unconditionally.
>>
>> I looked back at the fixmap, and I can see a way it could be done
>> (using NR_CPUS) like the other fixmap ranges. It would limit the
>> number of cpus to 512 (there is 2M memory left on fixmap on the
>> default configuration). That's if we never add any other fixmap on
>> x64. I don't know if it is an acceptable number and if the fixmap
>> region could be increased. (128 if we do your kvm trick, of course).
>>
>> Ingo: What do you think?
>
> I think we should scale the fixmap size flexibly with NR_CPUs on 64-bit, and we
> should limit CPUs on 32-bit to a reasonable value.

Unless the headers are even more tangled than usual, I think we could
just handle this exactly.  The top and bottom of the fixmap are known
exactly at compile time, so we should just be able to make the next mm
range adjust its start point accordingly.  The main issue right now
seems to be that MODULES_END is hard-coded.

For 32-bit, the task gate in #DF is going to be a show-stopper for an
RO fixmap, I think.  I'm still slightly in favor of enabling the code
but with an RW mapping on 32-bit, though.  And anyone running hundreds
to thousands of CPUs on 32-bit is nuts anyway.

--Andy
Thomas Garnier Jan. 9, 2017, 10:32 p.m. UTC | #36
On Fri, Jan 6, 2017 at 11:35 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Thomas Garnier <thgarnie@google.com> wrote:
>
>> > No, and I had the way this worked on 64-bit wrong.  LTR requires an
>> > available TSS and changes it to busy.  So here are my thoughts on how
>> > this should work:
>> >
>> > Let's get rid of any connection between this code and KASLR.  Every
>> > time KASLR makes something work differently, a kitten turns all
>> > Schrödinger on us.  This is moving the GDT to the fixmap, plain and
>> > simple.  For now, make it one page per CPU and don't worry about the
>> > GDT limit.
>>
>> I am all for this change but that's more significant.
>>
>> Ingo: What do you think about that?
>
> I agree with Andy: as I alluded to earlier as well this should be an unconditional
> change (tested properly, etc.) that robustifies the GDT mapping for everyone. That
> KASLR kernels improve too is a happy side effect!
>
>> > On 32-bit, we're going to have to make the fixmap GDT be read-write because
>> > making it read-only will break double-fault handling.
>> >
>> > On 64-bit, we can use your trick of temporarily mapping the GDT read-write
>> > every time we load TR, which should happen very rarely. Alternatively, we can
>> > reload the *GDT* every time we reload TR, which should be comparably slow.
>> > This is going to regress performance in the extremely rare case where KVM
>> > exits to a process that uses ioperm() (I think), but I doubt anyone cares.  Or
>> > maybe we could arrange to never reload TR when GDT points at the fixmap by
>> > having KVM set the host GDT to the direct version and letting KVM's code to
>> > reload the GDT switch to the fixmap copy.
>
> Please check whether the LTR write generates a page fault to a RO PTE even if the
> busy bit is already set. LTR is pretty slow which suggests that it's microcode,
> and microcode is usually not sloppy about such things: i.e. LTR would only
> generate an unconditional write if there's a compatibility dependency on it. But I
> could easily be wrong ...
>

Coming back on that after a bit more testing. The LTR instruction
check if the busy bit is already set, if already set then it will just
issue a #GP given a bad selector:

[    0.000000] general protection fault: 0040 [#1] SMP
...
[    0.000000] RIP: 0010:native_load_tr_desc+0x9/0x10
...
[    0.000000] Call Trace:
[    0.000000]  cpu_init+0x2d0/0x3c0
[    0.000000]  trap_init+0x2a2/0x312
[    0.000000]  start_kernel+0x1fb/0x43b
[    0.000000]  ? set_init_arg+0x55/0x55
[    0.000000]  ? early_idt_handler_array+0x120/0x120
[    0.000000]  x86_64_start_reservations+0x2a/0x2c
[    0.000000]  x86_64_start_kernel+0x13d/0x14c
[    0.000000]  start_cpu+0x14/0x14

I assume that's in this part of the pseudo-code:

if(!IsWithinDescriptorTableLimit(Source.Offset) || Source.Type !=
TypeGlobal) Exception(GP(SegmentSelector));
SegmentDescriptor = ReadSegmentDescriptor();
if(!IsForAnAvailableTSS(SegmentDescriptor))
Exception(GP(SegmentSelector)); <---- That's where I got the GP
TSSSegmentDescriptor.Busy = 1;
<------------------------------------------------------------------
That's the pagefault I get otherwise
//Locked read-modify-write operation on the entire descriptor when
setting busy flag
TaskRegister.SegmentSelector = Source;
TaskRegister.SegmentDescriptor.TSSSegmentDescriptor;

I assume the best option would be to make the remap read-write for the
LTR instruction. What do you think?

>> > If we need a quirk to keep the fixmap copy read-write, so be it.
>> >
>> > None of this should depend on KASLR.  IMO it should happen unconditionally.
>>
>> I looked back at the fixmap, and I can see a way it could be done
>> (using NR_CPUS) like the other fixmap ranges. It would limit the
>> number of cpus to 512 (there is 2M memory left on fixmap on the
>> default configuration). That's if we never add any other fixmap on
>> x64. I don't know if it is an acceptable number and if the fixmap
>> region could be increased. (128 if we do your kvm trick, of course).
>>
>> Ingo: What do you think?
>
> I think we should scale the fixmap size flexibly with NR_CPUs on 64-bit, and we
> should limit CPUs on 32-bit to a reasonable value.
>
> I.e. let's just do it, if we run into problems it's all solvable AFAICS.
>
> Thanks,
>
>         Ingo
Ingo Molnar Jan. 10, 2017, 10:27 a.m. UTC | #37
* Thomas Garnier <thgarnie@google.com> wrote:

> Coming back on that after a bit more testing. The LTR instruction
> check if the busy bit is already set, if already set then it will just
> issue a #GP given a bad selector:
> 
> [    0.000000] general protection fault: 0040 [#1] SMP
> ...
> [    0.000000] RIP: 0010:native_load_tr_desc+0x9/0x10
> ...
> [    0.000000] Call Trace:
> [    0.000000]  cpu_init+0x2d0/0x3c0
> [    0.000000]  trap_init+0x2a2/0x312
> [    0.000000]  start_kernel+0x1fb/0x43b
> [    0.000000]  ? set_init_arg+0x55/0x55
> [    0.000000]  ? early_idt_handler_array+0x120/0x120
> [    0.000000]  x86_64_start_reservations+0x2a/0x2c
> [    0.000000]  x86_64_start_kernel+0x13d/0x14c
> [    0.000000]  start_cpu+0x14/0x14
> 
> I assume that's in this part of the pseudo-code:
> 
> if(!IsWithinDescriptorTableLimit(Source.Offset) || Source.Type !=
> TypeGlobal) Exception(GP(SegmentSelector));
> SegmentDescriptor = ReadSegmentDescriptor();
> if(!IsForAnAvailableTSS(SegmentDescriptor))
> Exception(GP(SegmentSelector)); <---- That's where I got the GP
> TSSSegmentDescriptor.Busy = 1;
> <------------------------------------------------------------------
> That's the pagefault I get otherwise
> //Locked read-modify-write operation on the entire descriptor when
> setting busy flag
> TaskRegister.SegmentSelector = Source;
> TaskRegister.SegmentDescriptor.TSSSegmentDescriptor;
> 
> I assume the best option would be to make the remap read-write for the
> LTR instruction. What do you think?

So if LTR does not modify the GDT if the busy bit is already set, why don't we set 
the busy bit in the descriptor (via the linear mapping rw alias).

Then the remapped GDT can stay read-only all the time and LTR won't fault.

Am I missing something here?

Thanks,

	Ingo
Thomas Garnier Jan. 10, 2017, 5:13 p.m. UTC | #38
On Tue, Jan 10, 2017 at 2:27 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Thomas Garnier <thgarnie@google.com> wrote:
>
>> Coming back on that after a bit more testing. The LTR instruction
>> check if the busy bit is already set, if already set then it will just
>> issue a #GP given a bad selector:
>>
>> [    0.000000] general protection fault: 0040 [#1] SMP
>> ...
>> [    0.000000] RIP: 0010:native_load_tr_desc+0x9/0x10
>> ...
>> [    0.000000] Call Trace:
>> [    0.000000]  cpu_init+0x2d0/0x3c0
>> [    0.000000]  trap_init+0x2a2/0x312
>> [    0.000000]  start_kernel+0x1fb/0x43b
>> [    0.000000]  ? set_init_arg+0x55/0x55
>> [    0.000000]  ? early_idt_handler_array+0x120/0x120
>> [    0.000000]  x86_64_start_reservations+0x2a/0x2c
>> [    0.000000]  x86_64_start_kernel+0x13d/0x14c
>> [    0.000000]  start_cpu+0x14/0x14
>>
>> I assume that's in this part of the pseudo-code:
>>
>> if(!IsWithinDescriptorTableLimit(Source.Offset) || Source.Type !=
>> TypeGlobal) Exception(GP(SegmentSelector));
>> SegmentDescriptor = ReadSegmentDescriptor();
>> if(!IsForAnAvailableTSS(SegmentDescriptor))
>> Exception(GP(SegmentSelector)); <---- That's where I got the GP
>> TSSSegmentDescriptor.Busy = 1;
>> <------------------------------------------------------------------
>> That's the pagefault I get otherwise
>> //Locked read-modify-write operation on the entire descriptor when
>> setting busy flag
>> TaskRegister.SegmentSelector = Source;
>> TaskRegister.SegmentDescriptor.TSSSegmentDescriptor;
>>
>> I assume the best option would be to make the remap read-write for the
>> LTR instruction. What do you think?
>
> So if LTR does not modify the GDT if the busy bit is already set, why don't we set
> the busy bit in the descriptor (via the linear mapping rw alias).
>
> Then the remapped GDT can stay read-only all the time and LTR won't fault.
>
> Am I missing something here?
>

Sorry, I may not have explained myself well.

If you set the busy bit on the GDT TSS entry, you get a #GP on LTR.

When we use the LTR instruction, we need the GDT to be writeable.

I think I can handle it by switching to the remap GDT after
load_TR_desc. I will try this approach.

> Thanks,
>
>         Ingo
diff mbox

Patch

diff --git a/arch/x86/include/asm/kaslr.h b/arch/x86/include/asm/kaslr.h
index 1052a797d71d..babc32803182 100644
--- a/arch/x86/include/asm/kaslr.h
+++ b/arch/x86/include/asm/kaslr.h
@@ -9,8 +9,12 @@  extern unsigned long vmalloc_base;
 extern unsigned long vmemmap_base;
 
 void kernel_randomize_memory(void);
+void kernel_randomize_smp(void);
+void* kaslr_get_gdt_remap(int cpu);
 #else
 static inline void kernel_randomize_memory(void) { }
+static inline void kernel_randomize_smp(void) { }
+static inline void *kaslr_get_gdt_remap(int cpu) { return NULL; }
 #endif /* CONFIG_RANDOMIZE_MEMORY */
 
 #endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index dc1697ca5191..2c8a7b4718ea 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -450,8 +450,13 @@  void load_percpu_segment(int cpu)
 void switch_to_new_gdt(int cpu)
 {
 	struct desc_ptr gdt_descr;
+	struct desc_struct *gdt;
 
-	gdt_descr.address = (long)get_cpu_gdt_table(cpu);
+	gdt = kaslr_get_gdt_remap(cpu);
+	if (!gdt)
+		gdt = get_cpu_gdt_table(cpu);
+
+	gdt_descr.address = (long)gdt;
 	gdt_descr.size = GDT_SIZE - 1;
 	load_gdt(&gdt_descr);
 	/* Reload the per-cpu base */
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index ea9c49adaa1f..213fe01f28dc 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -50,6 +50,9 @@  enum address_markers_idx {
 	LOW_KERNEL_NR,
 	VMALLOC_START_NR,
 	VMEMMAP_START_NR,
+# ifdef CONFIG_RANDOMIZE_MEMORY
+	GDT_REMAP_NR,
+# endif
 # ifdef CONFIG_X86_ESPFIX64
 	ESPFIX_START_NR,
 # endif
@@ -75,6 +78,9 @@  static struct addr_marker address_markers[] = {
 	{ 0/* PAGE_OFFSET */,   "Low Kernel Mapping" },
 	{ 0/* VMALLOC_START */, "vmalloc() Area" },
 	{ 0/* VMEMMAP_START */, "Vmemmap" },
+# ifdef CONFIG_RANDOMIZE_MEMORY
+	{ 0, "GDT remapping" },
+# endif
 # ifdef CONFIG_X86_ESPFIX64
 	{ ESPFIX_BASE_ADDR,	"ESPfix Area", 16 },
 # endif
@@ -442,6 +448,10 @@  static int __init pt_dump_init(void)
 	address_markers[LOW_KERNEL_NR].start_address = PAGE_OFFSET;
 	address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
 	address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START;
+#ifdef CONFIG_RANDOMIZE_MEMORY
+	address_markers[GDT_REMAP_NR].start_address =
+		(unsigned long) kaslr_get_gdt_remap(0);
+#endif
 #endif
 #ifdef CONFIG_X86_32
 	address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 887e57182716..db1bdb75f8af 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -22,11 +22,13 @@ 
 #include <linux/kernel.h>
 #include <linux/init.h>
 #include <linux/random.h>
+#include <linux/slab.h>
 
 #include <asm/pgalloc.h>
 #include <asm/pgtable.h>
 #include <asm/setup.h>
 #include <asm/kaslr.h>
+#include <asm/desc.h>
 
 #include "mm_internal.h"
 
@@ -60,6 +62,7 @@  unsigned long vmalloc_base = __VMALLOC_BASE;
 EXPORT_SYMBOL(vmalloc_base);
 unsigned long vmemmap_base = __VMEMMAP_BASE;
 EXPORT_SYMBOL(vmemmap_base);
+unsigned long gdt_tables_base = 0;
 
 /*
  * Memory regions randomized by KASLR (except modules that use a separate logic
@@ -97,7 +100,7 @@  void __init kernel_randomize_memory(void)
 	unsigned long vaddr = vaddr_start;
 	unsigned long rand, memory_tb;
 	struct rnd_state rand_state;
-	unsigned long remain_entropy;
+	unsigned long remain_entropy, gdt_reserved;
 
 	/*
 	 * All these BUILD_BUG_ON checks ensures the memory layout is
@@ -131,6 +134,13 @@  void __init kernel_randomize_memory(void)
 	for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++)
 		remain_entropy -= get_padding(&kaslr_regions[i]);
 
+	/* Reserve space for fixed GDTs, if we have enough available */
+	gdt_reserved = sizeof(struct gdt_page) * max(setup_max_cpus, 1U);
+	if (gdt_reserved < remain_entropy) {
+		gdt_tables_base = vaddr_end - gdt_reserved;
+		remain_entropy -= gdt_reserved;
+	}
+
 	prandom_seed_state(&rand_state, kaslr_get_random_long("Memory"));
 
 	for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++) {
@@ -192,3 +202,98 @@  void __meminit init_trampoline(void)
 	set_pgd(&trampoline_pgd_entry,
 		__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
 }
+
+/* Hold the remapping of the gdt page for each cpu */
+DEFINE_PER_CPU_PAGE_ALIGNED(struct desc_struct *, gdt_remap);
+
+/* Return the address where the GDT is remapped for this CPU */
+static unsigned long gdt_remap_address(int cpu)
+{
+	return gdt_tables_base + cpu * sizeof(struct gdt_page);
+}
+
+/* Remap the specified gdt table */
+static struct desc_struct *remap_gdt(int cpu)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	struct desc_struct *gdt;
+	unsigned long addr;
+
+	/* GDT table should be only one page */
+	BUILD_BUG_ON(sizeof(struct gdt_page) != PAGE_SIZE);
+
+	/* Keep the original GDT before the allocator is available */
+	if (!slab_is_available())
+		return NULL;
+
+	gdt = get_cpu_gdt_table(cpu);
+	addr = gdt_remap_address(cpu);
+
+	pgd = pgd_offset_k(addr);
+	pud = pud_alloc(&init_mm, pgd, addr);
+	if (WARN_ON(!pud))
+		return NULL;
+	pmd = pmd_alloc(&init_mm, pud, addr);
+	if (WARN_ON(!pmd))
+		return NULL;
+	pte = pte_alloc_kernel(pmd, addr);
+	if (WARN_ON(!pte))
+		return NULL;
+
+	/* If the PTE is already set, something is wrong with the VA ranges */
+	BUG_ON(!pte_none(*pte));
+
+	/* Remap the target GDT and return it */
+	set_pte_at(&init_mm, addr, pte,
+		   pfn_pte(PFN_DOWN(__pa(gdt)), PAGE_KERNEL));
+	gdt = (struct desc_struct *)addr;
+	per_cpu(gdt_remap, cpu) = gdt;
+	return gdt;
+}
+
+/* Check if GDT remapping is enabled */
+static bool kaslr_gdt_remap_enabled(void)
+{
+	return kaslr_memory_enabled() && gdt_tables_base != 0;
+}
+
+/*
+ * The GDT table address is available to user-mode through the sgdt
+ * instruction. This function will return a fixed remapping to load so you
+ * cannot leak the per-cpu structure address.
+ */
+void* kaslr_get_gdt_remap(int cpu)
+{
+	struct desc_struct *gdt_remapping;
+
+	if (!kaslr_gdt_remap_enabled())
+		return NULL;
+
+	gdt_remapping = per_cpu(gdt_remap, cpu);
+	if (!gdt_remapping)
+		gdt_remapping = remap_gdt(cpu);
+
+	return gdt_remapping;
+}
+
+/*
+ * Switch the first processor GDT to the remapping. The GDT is loaded too early
+ * to generate the remapping correctly. This step is done later at boot or
+ * before other processors come back from hibernation.
+ */
+void kernel_randomize_smp(void)
+{
+	struct desc_ptr gdt_descr;
+	struct desc_struct *gdt;
+
+	gdt = kaslr_get_gdt_remap(raw_smp_processor_id());
+	if (WARN_ON(!gdt))
+		return;
+
+	gdt_descr.address = (long)gdt;
+	gdt_descr.size = GDT_SIZE - 1;
+	load_gdt(&gdt_descr);
+}
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f75c4d031eeb..4d6979299b9a 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1040,6 +1040,9 @@  void enable_nonboot_cpus(void)
 {
 	int cpu, error;
 
+	/* Redo KASLR steps for main processor */
+	kernel_randomize_smp();
+
 	/* Allow everyone to use the CPU hotplug again */
 	cpu_maps_update_begin();
 	__cpu_hotplug_enable();
diff --git a/kernel/smp.c b/kernel/smp.c
index 77fcdb9f2775..e1ef8d05e179 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -554,6 +554,7 @@  void __init smp_init(void)
 
 	idle_threads_init();
 	cpuhp_threads_init();
+	kernel_randomize_smp();
 
 	pr_info("Bringing up secondary CPUs ...\n");