Message ID | 20170104221630.831-1-thgarnie@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Thomas Garnier <thgarnie@google.com> wrote: > Each processor holds a GDT in its per-cpu structure. The sgdt > instruction gives the base address of the current GDT. This address can > be used to bypass KASLR memory randomization. With another bug, an > attacker could target other per-cpu structures or deduce the base of the > main memory section (PAGE_OFFSET). > > In this change, a space is reserved at the end of the memory range > available for KASLR memory randomization. The space is big enough to hold > the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is > mapped at specific offset based on the target CPU. Note that if there is > not enough space available, the GDTs are not remapped. > > The document was changed to mention GDT remapping for KASLR. This patch > also include dump page tables support. > > This patch was tested on multiple hardware configurations and for > hibernation support. > void kernel_randomize_memory(void); > +void kernel_randomize_smp(void); > +void* kaslr_get_gdt_remap(int cpu); Yeah, no fundamental objections from me to the principle, but I get some bad vibes from the naming here: seeing that kernel_randomize_smp() actually makes things less random. Also, don't we want to do this unconditionally and not allow remapping failures? The GDT is fairly small, plus making the SGDT instruction expose fewer kernel internals would be (marginally) useful on non-randomized kernels as well. It also makes the code more common, more predictable, more debuggable and less complex overall - which is pretty valuable in terms of long term security as well. Thanks, Ingo
On 1/5/2017 12:11 AM, Ingo Molnar wrote: > > * Thomas Garnier <thgarnie@google.com> wrote: > >> Each processor holds a GDT in its per-cpu structure. The sgdt >> instruction gives the base address of the current GDT. This address can >> be used to bypass KASLR memory randomization. With another bug, an >> attacker could target other per-cpu structures or deduce the base of the >> main memory section (PAGE_OFFSET). >> >> In this change, a space is reserved at the end of the memory range >> available for KASLR memory randomization. The space is big enough to hold >> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is >> mapped at specific offset based on the target CPU. Note that if there is >> not enough space available, the GDTs are not remapped. >> >> The document was changed to mention GDT remapping for KASLR. This patch >> also include dump page tables support. >> >> This patch was tested on multiple hardware configurations and for >> hibernation support. > >> void kernel_randomize_memory(void); >> +void kernel_randomize_smp(void); >> +void* kaslr_get_gdt_remap(int cpu); > > Yeah, no fundamental objections from me to the principle, but I get some bad vibes > from the naming here: seeing that kernel_randomize_smp() actually makes things > less random. > kernel_unrandomize_smp() ... one request.. can we make sure this unrandomization is optional?
On Thu, Jan 5, 2017 at 12:11 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Thomas Garnier <thgarnie@google.com> wrote: > >> Each processor holds a GDT in its per-cpu structure. The sgdt >> instruction gives the base address of the current GDT. This address can >> be used to bypass KASLR memory randomization. With another bug, an >> attacker could target other per-cpu structures or deduce the base of the >> main memory section (PAGE_OFFSET). >> >> In this change, a space is reserved at the end of the memory range >> available for KASLR memory randomization. The space is big enough to hold >> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is >> mapped at specific offset based on the target CPU. Note that if there is >> not enough space available, the GDTs are not remapped. >> >> The document was changed to mention GDT remapping for KASLR. This patch >> also include dump page tables support. >> >> This patch was tested on multiple hardware configurations and for >> hibernation support. > >> void kernel_randomize_memory(void); >> +void kernel_randomize_smp(void); >> +void* kaslr_get_gdt_remap(int cpu); > > Yeah, no fundamental objections from me to the principle, but I get some bad vibes > from the naming here: seeing that kernel_randomize_smp() actually makes things > less random. > I agree, I went back and forth on the name. I will change it to something better. > Also, don't we want to do this unconditionally and not allow remapping failures? > > The GDT is fairly small, plus making the SGDT instruction expose fewer kernel > internals would be (marginally) useful on non-randomized kernels as well. > > It also makes the code more common, more predictable, more debuggable and less > complex overall - which is pretty valuable in terms of long term security as well. > Okay, I will add BUG_ON on failures to remap. > Thanks, > > Ingo Ingo: I saw the 5-level page table support being sent through. Do you want me to wait for it to be -next? (Given it will need to be changed too).
On Thu, Jan 5, 2017 at 7:08 AM, Arjan van de Ven <arjan@linux.intel.com> wrote: > On 1/5/2017 12:11 AM, Ingo Molnar wrote: >> >> >> * Thomas Garnier <thgarnie@google.com> wrote: >> >>> Each processor holds a GDT in its per-cpu structure. The sgdt >>> instruction gives the base address of the current GDT. This address can >>> be used to bypass KASLR memory randomization. With another bug, an >>> attacker could target other per-cpu structures or deduce the base of the >>> main memory section (PAGE_OFFSET). >>> >>> In this change, a space is reserved at the end of the memory range >>> available for KASLR memory randomization. The space is big enough to hold >>> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is >>> mapped at specific offset based on the target CPU. Note that if there is >>> not enough space available, the GDTs are not remapped. >>> >>> The document was changed to mention GDT remapping for KASLR. This patch >>> also include dump page tables support. >>> >>> This patch was tested on multiple hardware configurations and for >>> hibernation support. >> >> >>> void kernel_randomize_memory(void); >>> +void kernel_randomize_smp(void); >>> +void* kaslr_get_gdt_remap(int cpu); >> >> >> Yeah, no fundamental objections from me to the principle, but I get some >> bad vibes >> from the naming here: seeing that kernel_randomize_smp() actually makes >> things >> less random. >> > > kernel_unrandomize_smp() ... > That seems like a better name. > one request.. can we make sure this unrandomization is optional? > Well, it happens only when KASLR memory randomization is enabled. Do you think it should have a separate config option?
On Wed, Jan 4, 2017 at 2:16 PM, Thomas Garnier <thgarnie@google.com> wrote: > Each processor holds a GDT in its per-cpu structure. The sgdt > instruction gives the base address of the current GDT. This address can > be used to bypass KASLR memory randomization. With another bug, an > attacker could target other per-cpu structures or deduce the base of the > main memory section (PAGE_OFFSET). > > In this change, a space is reserved at the end of the memory range > available for KASLR memory randomization. The space is big enough to hold > the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is > mapped at specific offset based on the target CPU. Note that if there is > not enough space available, the GDTs are not remapped. Can we remap it read-only? I.e. use PAGE_KERNEL_RO instead of PAGE_KERNEL. After all, the ability to modify the GDT is instant root.
On Thu, Jan 5, 2017 at 9:51 AM, Andy Lutomirski <luto@amacapital.net> wrote: > On Wed, Jan 4, 2017 at 2:16 PM, Thomas Garnier <thgarnie@google.com> wrote: >> Each processor holds a GDT in its per-cpu structure. The sgdt >> instruction gives the base address of the current GDT. This address can >> be used to bypass KASLR memory randomization. With another bug, an >> attacker could target other per-cpu structures or deduce the base of the >> main memory section (PAGE_OFFSET). >> >> In this change, a space is reserved at the end of the memory range >> available for KASLR memory randomization. The space is big enough to hold >> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is >> mapped at specific offset based on the target CPU. Note that if there is >> not enough space available, the GDTs are not remapped. > > Can we remap it read-only? I.e. use PAGE_KERNEL_RO instead of > PAGE_KERNEL. After all, the ability to modify the GDT is instant > root. That's my goal too. I started by doing a RO remap and got couple problems with hibernation. I can try again for the next iteration or delay it for another patch. I also need to look at KVM GDT usage, I am not familiar with it yet.
On Thu, Jan 5, 2017 at 9:54 AM, Thomas Garnier <thgarnie@google.com> wrote: > On Thu, Jan 5, 2017 at 9:51 AM, Andy Lutomirski <luto@amacapital.net> wrote: >> On Wed, Jan 4, 2017 at 2:16 PM, Thomas Garnier <thgarnie@google.com> wrote: >>> Each processor holds a GDT in its per-cpu structure. The sgdt >>> instruction gives the base address of the current GDT. This address can >>> be used to bypass KASLR memory randomization. With another bug, an >>> attacker could target other per-cpu structures or deduce the base of the >>> main memory section (PAGE_OFFSET). >>> >>> In this change, a space is reserved at the end of the memory range >>> available for KASLR memory randomization. The space is big enough to hold >>> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is >>> mapped at specific offset based on the target CPU. Note that if there is >>> not enough space available, the GDTs are not remapped. >> >> Can we remap it read-only? I.e. use PAGE_KERNEL_RO instead of >> PAGE_KERNEL. After all, the ability to modify the GDT is instant >> root. > > That's my goal too. I started by doing a RO remap and got couple > problems with hibernation. I can try again for the next iteration or > delay it for another patch. I also need to look at KVM GDT usage, I am > not familiar with it yet. If you want a small adventure, I think a significant KVM-related performance improvement is available. Specifically, on VMX exits, the GDT limit is hardwired to 0xffff (IIRC -- I could be remembering the actual vaue wrong). KVM does LGDT to fix it. If we actually made the GDT have limit 0xffff (presumably by mapping the zero page a few times to pad it out without wasting memory), then we would avoid the LGDT. LGDT is incredibly slow, so this would be a big win. Want to see if you can make this work with your patch set?
On Thu, Jan 5, 2017 at 10:01 AM, Andy Lutomirski <luto@amacapital.net> wrote: > On Thu, Jan 5, 2017 at 9:54 AM, Thomas Garnier <thgarnie@google.com> wrote: >> On Thu, Jan 5, 2017 at 9:51 AM, Andy Lutomirski <luto@amacapital.net> wrote: >>> On Wed, Jan 4, 2017 at 2:16 PM, Thomas Garnier <thgarnie@google.com> wrote: >>>> Each processor holds a GDT in its per-cpu structure. The sgdt >>>> instruction gives the base address of the current GDT. This address can >>>> be used to bypass KASLR memory randomization. With another bug, an >>>> attacker could target other per-cpu structures or deduce the base of the >>>> main memory section (PAGE_OFFSET). >>>> >>>> In this change, a space is reserved at the end of the memory range >>>> available for KASLR memory randomization. The space is big enough to hold >>>> the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is >>>> mapped at specific offset based on the target CPU. Note that if there is >>>> not enough space available, the GDTs are not remapped. >>> >>> Can we remap it read-only? I.e. use PAGE_KERNEL_RO instead of >>> PAGE_KERNEL. After all, the ability to modify the GDT is instant >>> root. >> >> That's my goal too. I started by doing a RO remap and got couple >> problems with hibernation. I can try again for the next iteration or >> delay it for another patch. I also need to look at KVM GDT usage, I am >> not familiar with it yet. > > If you want a small adventure, I think a significant KVM-related > performance improvement is available. Specifically, on VMX exits, the > GDT limit is hardwired to 0xffff (IIRC -- I could be remembering the > actual vaue wrong). KVM does LGDT to fix it. > > If we actually made the GDT have limit 0xffff (presumably by mapping > the zero page a few times to pad it out without wasting memory), then > we would avoid the LGDT. LGDT is incredibly slow, so this would be a > big win. Want to see if you can make this work with your patch set? I can always take a look. If you have any prototype or more details, feel free to send it to me on a separate thread.
On 1/5/2017 8:40 AM, Thomas Garnier wrote: > Well, it happens only when KASLR memory randomization is enabled. Do > you think it should have a separate config option? no I would want it a runtime option.... "sgdt from ring 3" is going away with UMIP (and is already possibly gone in virtual machines, see https://lwn.net/Articles/694385/) and for those cases it would be a shame to lose the randomization
On 1/5/2017 9:54 AM, Thomas Garnier wrote: > > That's my goal too. I started by doing a RO remap and got couple > problems with hibernation. I can try again for the next iteration or > delay it for another patch. I also need to look at KVM GDT usage, I am > not familiar with it yet. don't we write to the GDT as part of the TLS segment stuff for glibc ?
On Thu, Jan 5, 2017 at 10:56 AM, Arjan van de Ven <arjan@linux.intel.com> wrote: > On 1/5/2017 8:40 AM, Thomas Garnier wrote: >> >> Well, it happens only when KASLR memory randomization is enabled. Do >> you think it should have a separate config option? > > > no I would want it a runtime option.... "sgdt from ring 3" is going away > with UMIP (and is already possibly gone in virtual machines, see > https://lwn.net/Articles/694385/) and for those cases it would be a shame > to lose the randomization > That's correct. When UMIP is enabled, we should disable fixed location for both GDT and IDT. Glad to do that when UMIP support is added.
On Thu, Jan 5, 2017 at 10:58 AM, Arjan van de Ven <arjan@linux.intel.com> wrote: > On 1/5/2017 9:54 AM, Thomas Garnier wrote: > >> >> That's my goal too. I started by doing a RO remap and got couple >> problems with hibernation. I can try again for the next iteration or >> delay it for another patch. I also need to look at KVM GDT usage, I am >> not familiar with it yet. > > > don't we write to the GDT as part of the TLS segment stuff for glibc ? > Not sure which glibc feature it is. In this design, you can write to the GDT per-cpu variable that will remain read-write. You just need to make the remapping writeable when we load task registers (ltr) then the processor use the current GDT address. At least that the case I know, I might find more through testing.
On Thu, Jan 5, 2017 at 11:03 AM, Thomas Garnier <thgarnie@google.com> wrote: > On Thu, Jan 5, 2017 at 10:58 AM, Arjan van de Ven <arjan@linux.intel.com> wrote: >> On 1/5/2017 9:54 AM, Thomas Garnier wrote: >> >>> >>> That's my goal too. I started by doing a RO remap and got couple >>> problems with hibernation. I can try again for the next iteration or >>> delay it for another patch. I also need to look at KVM GDT usage, I am >>> not familiar with it yet. >> >> >> don't we write to the GDT as part of the TLS segment stuff for glibc ? >> > > Not sure which glibc feature it is. > > In this design, you can write to the GDT per-cpu variable that will > remain read-write. You just need to make the remapping writeable when > we load task registers (ltr) then the processor use the current GDT > address. At least that the case I know, I might find more through > testing. Hmm. I bet that if we preset the accessed bits in all the segments then we don't need it to be writable in general. But your point about set_thread_area (TLS) is well taken. However, I strongly suspect that we could make set_thread_area unconditionally set the accessed bit and no one would ever notice.
On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote: > On Thu, Jan 5, 2017 at 11:03 AM, Thomas Garnier <thgarnie@google.com> wrote: >> On Thu, Jan 5, 2017 at 10:58 AM, Arjan van de Ven <arjan@linux.intel.com> wrote: >>> On 1/5/2017 9:54 AM, Thomas Garnier wrote: >>> >>>> >>>> That's my goal too. I started by doing a RO remap and got couple >>>> problems with hibernation. I can try again for the next iteration or >>>> delay it for another patch. I also need to look at KVM GDT usage, I am >>>> not familiar with it yet. >>> >>> >>> don't we write to the GDT as part of the TLS segment stuff for glibc ? >>> >> >> Not sure which glibc feature it is. >> >> In this design, you can write to the GDT per-cpu variable that will >> remain read-write. You just need to make the remapping writeable when >> we load task registers (ltr) then the processor use the current GDT >> address. At least that the case I know, I might find more through >> testing. > > Hmm. I bet that if we preset the accessed bits in all the segments > then we don't need it to be writable in general. But your point about > set_thread_area (TLS) is well taken. However, I strongly suspect that > we could make set_thread_area unconditionally set the accessed bit and > no one would ever notice. Not sure I fully understood and I don't want to miss an important point. Do you mean making GDT (remapping and per-cpu) read-only and switch the writeable flag only when we write to the per-cpu entry?
On Thu, Jan 5, 2017 at 1:08 PM, Thomas Garnier <thgarnie@google.com> wrote: > On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote: >> On Thu, Jan 5, 2017 at 11:03 AM, Thomas Garnier <thgarnie@google.com> wrote: >>> On Thu, Jan 5, 2017 at 10:58 AM, Arjan van de Ven <arjan@linux.intel.com> wrote: >>>> On 1/5/2017 9:54 AM, Thomas Garnier wrote: >>>> >>>>> >>>>> That's my goal too. I started by doing a RO remap and got couple >>>>> problems with hibernation. I can try again for the next iteration or >>>>> delay it for another patch. I also need to look at KVM GDT usage, I am >>>>> not familiar with it yet. >>>> >>>> >>>> don't we write to the GDT as part of the TLS segment stuff for glibc ? >>>> >>> >>> Not sure which glibc feature it is. >>> >>> In this design, you can write to the GDT per-cpu variable that will >>> remain read-write. You just need to make the remapping writeable when >>> we load task registers (ltr) then the processor use the current GDT >>> address. At least that the case I know, I might find more through >>> testing. >> >> Hmm. I bet that if we preset the accessed bits in all the segments >> then we don't need it to be writable in general. But your point about >> set_thread_area (TLS) is well taken. However, I strongly suspect that >> we could make set_thread_area unconditionally set the accessed bit and >> no one would ever notice. > > Not sure I fully understood and I don't want to miss an important > point. Do you mean making GDT (remapping and per-cpu) read-only and > switch the writeable flag only when we write to the per-cpu entry? > What I mean is: write to the GDT through normal percpu access (or whatever the normal mapping is) but load a read-only alias into the GDT register. As long as nothing ever tries to write through the GDTR alias, no page faults will be generated. So we just need to make sure that nothing ever writes to it through GDTR. AFAIK the only reason the CPU ever writes to the address in GDTR is to set an accessed bit. --Andy
On Thu, Jan 5, 2017 at 1:19 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Thu, Jan 5, 2017 at 1:08 PM, Thomas Garnier <thgarnie@google.com> wrote: >> On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote: >>> On Thu, Jan 5, 2017 at 11:03 AM, Thomas Garnier <thgarnie@google.com> wrote: >>>> On Thu, Jan 5, 2017 at 10:58 AM, Arjan van de Ven <arjan@linux.intel.com> wrote: >>>>> On 1/5/2017 9:54 AM, Thomas Garnier wrote: >>>>> >>>>>> >>>>>> That's my goal too. I started by doing a RO remap and got couple >>>>>> problems with hibernation. I can try again for the next iteration or >>>>>> delay it for another patch. I also need to look at KVM GDT usage, I am >>>>>> not familiar with it yet. >>>>> >>>>> >>>>> don't we write to the GDT as part of the TLS segment stuff for glibc ? >>>>> >>>> >>>> Not sure which glibc feature it is. >>>> >>>> In this design, you can write to the GDT per-cpu variable that will >>>> remain read-write. You just need to make the remapping writeable when >>>> we load task registers (ltr) then the processor use the current GDT >>>> address. At least that the case I know, I might find more through >>>> testing. >>> >>> Hmm. I bet that if we preset the accessed bits in all the segments >>> then we don't need it to be writable in general. But your point about >>> set_thread_area (TLS) is well taken. However, I strongly suspect that >>> we could make set_thread_area unconditionally set the accessed bit and >>> no one would ever notice. >> >> Not sure I fully understood and I don't want to miss an important >> point. Do you mean making GDT (remapping and per-cpu) read-only and >> switch the writeable flag only when we write to the per-cpu entry? >> > > What I mean is: write to the GDT through normal percpu access (or > whatever the normal mapping is) but load a read-only alias into the > GDT register. As long as nothing ever tries to write through the GDTR > alias, no page faults will be generated. So we just need to make sure > that nothing ever writes to it through GDTR. AFAIK the only reason > the CPU ever writes to the address in GDTR is to set an accessed bit. > A write is made when we use load_TR_desc (ltr). I didn't see any other yet. > --Andy
On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote: > > Hmm. I bet that if we preset the accessed bits in all the segments > then we don't need it to be writable in general. I'm not sure that this is architecturally safe. IIRC, we do mark the IDT read-only - but that one we started doing due to the f00f bug, so we knew it was ok. I'm not sure you can do the same with the GDT/LDT. Linus
On Thu, Jan 5, 2017 at 3:05 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> Hmm. I bet that if we preset the accessed bits in all the segments >> then we don't need it to be writable in general. > > I'm not sure that this is architecturally safe. > > IIRC, we do mark the IDT read-only - but that one we started doing due > to the f00f bug, so we knew it was ok. I'm not sure you can do the > same with the GDT/LDT. > I started testing a variant that make the GDT remapping read-only by default and writeable only for LTR. Everything works fine, even hibernation. I need to do more testing though on different architectures. To be on the safe side, I could separate the read-only part in a separate patch so we can easily remove it if extended testing show something. > Linus
On Thu, Jan 5, 2017 at 3:05 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> Hmm. I bet that if we preset the accessed bits in all the segments >> then we don't need it to be writable in general. > > I'm not sure that this is architecturally safe. > Hmm. Last time I looked, I couldn't find *anything* in the SDM explaining what happened if a GDT access resulted in a page fault. I did discover that Xen intentionally (!) lazily populates and maps LDT pages. An attempt to access a not-present page results in #PF with the error cod e indicating kernel access even if the access came from user mode. SDM volume 3 7.2.2 says "Pages corresponding to the previous task’s TSS, the current task’s TSS, and the descriptor table entries for each all should be marked as read/write." But I don't see how a CPU implementation could possibly care what the page table for the TSS descriptor table entries says after LTR is done because the CPU isn't even supposed to *read* that memory. OTOH a valid implementation could easily require that the page table says that the page is writable merely to load a segment, especially in weird cases (IRET?). That being said, this is all quite easy to test. Also, Thomas, why are you creating a new memory region? I don't see any benefit to randomizing the GDT address. How about just putting it in the fixmap? This would be NR_CPUS * 4 pages if do my limit=0xffff idea. I'm not sure if the fixmap code knows how to handle this much space.
* Thomas Garnier <thgarnie@google.com> wrote: > > Thanks, > > > > Ingo > > Ingo: I saw the 5-level page table support being sent through. Do you > want me to wait for it to be -next? (Given it will need to be changed > too). Please just base your bits on Linus's latest tree - we'll sort out any conflicts as/when they happen. Thanks, Ingo
* Arjan van de Ven <arjan@linux.intel.com> wrote: > On 1/5/2017 9:54 AM, Thomas Garnier wrote: > > > That's my goal too. I started by doing a RO remap and got couple problems with > > hibernation. I can try again for the next iteration or delay it for another > > patch. I also need to look at KVM GDT usage, I am not familiar with it yet. > > don't we write to the GDT as part of the TLS segment stuff for glibc ? Yes - but the idea would be to have two virtual memory aliases: 1) A world-readable RO mapping of the GDT put into the fixmap area (randomization is not really required as the SGDT instruction exposes it anyway). This read-only GDT mapping has two advantages: - Because the GDT address is read-only it cannot be used by exploits as an easy trampoline for 100% safe, deterministic rooting of the system from a local process. - Even on older CPUs the SGDT instruction won't expose the absolute address of critical kernel data structures. 2) A kernel-only RW mapping in the linear kernel memory map where the original page is - randomized (as a natural part of kernel physical and virtual memory randomization) and only known and accessible to the kernel, SGDT doesn't expose it even on older CPUs. This way we IMHO have the best of both worlds. Assuming there are no strange CPU errata surfacing if we use a RO GDT - OTOH that should be fairly easy to test for, and worst-case we could quirk-off the feature on CPUs where this cannot be done safely. Thanks, Ingo
* Thomas Garnier <thgarnie@google.com> wrote: > >> Not sure I fully understood and I don't want to miss an important point. Do > >> you mean making GDT (remapping and per-cpu) read-only and switch the > >> writeable flag only when we write to the per-cpu entry? > > > > What I mean is: write to the GDT through normal percpu access (or whatever the > > normal mapping is) but load a read-only alias into the GDT register. As long > > as nothing ever tries to write through the GDTR alias, no page faults will be > > generated. So we just need to make sure that nothing ever writes to it > > through GDTR. AFAIK the only reason the CPU ever writes to the address in > > GDTR is to set an accessed bit. > > A write is made when we use load_TR_desc (ltr). I didn't see any other yet. Is this write to the GDT, generated by the LTR instruction, done unconditionally by the hardware? Thanks, Ingo
On Thu, Jan 05, 2017 at 08:40:29AM -0800, Thomas Garnier wrote: > > kernel_unrandomize_smp() ... > > > > That seems like a better name. Hardly... I'd call it something like kaslr_load_gdt() to actually say what this function is doing.
On Thu, Jan 5, 2017 at 6:34 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Thu, Jan 5, 2017 at 3:05 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote: >>> >>> Hmm. I bet that if we preset the accessed bits in all the segments >>> then we don't need it to be writable in general. >> >> I'm not sure that this is architecturally safe. >> > > Hmm. Last time I looked, I couldn't find *anything* in the SDM > explaining what happened if a GDT access resulted in a page fault. I > did discover that Xen intentionally (!) lazily populates and maps LDT > pages. An attempt to access a not-present page results in #PF with > the error cod e indicating kernel access even if the access came from > user mode. > > SDM volume 3 7.2.2 says "Pages corresponding to the previous task’s > TSS, the current task’s TSS, and the descriptor table entries for > each all should be marked as read/write." But I don't see how a CPU > implementation could possibly care what the page table for the TSS > descriptor table entries says after LTR is done because the CPU isn't > even supposed to *read* that memory. > > OTOH a valid implementation could easily require that the page table > says that the page is writable merely to load a segment, especially in > weird cases (IRET?). That being said, this is all quite easy to test. > > Also, Thomas, why are you creating a new memory region? I don't see > any benefit to randomizing the GDT address. How about just putting it > in the fixmap? This would be NR_CPUS * 4 pages if do my limit=0xffff > idea. I'm not sure if the fixmap code knows how to handle this much > space. When I looked at the fixmap, you had to define the space you need ahead of time and I am not sure there was enough space as you said.
On Thu, Jan 5, 2017 at 10:49 PM, Ingo Molnar <mingo@kernel.org> wrote: > > * Thomas Garnier <thgarnie@google.com> wrote: > >> >> Not sure I fully understood and I don't want to miss an important point. Do >> >> you mean making GDT (remapping and per-cpu) read-only and switch the >> >> writeable flag only when we write to the per-cpu entry? >> > >> > What I mean is: write to the GDT through normal percpu access (or whatever the >> > normal mapping is) but load a read-only alias into the GDT register. As long >> > as nothing ever tries to write through the GDTR alias, no page faults will be >> > generated. So we just need to make sure that nothing ever writes to it >> > through GDTR. AFAIK the only reason the CPU ever writes to the address in >> > GDTR is to set an accessed bit. >> >> A write is made when we use load_TR_desc (ltr). I didn't see any other yet. > > Is this write to the GDT, generated by the LTR instruction, done unconditionally > by the hardware? > That was my experience. I didn't look into details. Do you think we could change something so that ltr never writes to the GDT? (just mark the TSS entry busy). > Thanks, > > Ingo
On Fri, Jan 6, 2017 at 9:44 AM, Borislav Petkov <bp@alien8.de> wrote: > On Thu, Jan 05, 2017 at 08:40:29AM -0800, Thomas Garnier wrote: >> > kernel_unrandomize_smp() ... >> > >> >> That seems like a better name. > > Hardly... I'd call it something like kaslr_load_gdt() to actually say > what this function is doing. > True, it loads it only for the main processor though. I will try a variant like kaslr_load_main_gdt(). > -- > Regards/Gruss, > Boris. > > Good mailing practices for 400: avoid top-posting and trim the reply.
On Fri, Jan 6, 2017 at 10:02 AM, Thomas Garnier <thgarnie@google.com> wrote: > On Thu, Jan 5, 2017 at 6:34 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> On Thu, Jan 5, 2017 at 3:05 PM, Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> On Thu, Jan 5, 2017 at 12:18 PM, Andy Lutomirski <luto@kernel.org> wrote: >>>> >>>> Hmm. I bet that if we preset the accessed bits in all the segments >>>> then we don't need it to be writable in general. >>> >>> I'm not sure that this is architecturally safe. >>> >> >> Hmm. Last time I looked, I couldn't find *anything* in the SDM >> explaining what happened if a GDT access resulted in a page fault. I >> did discover that Xen intentionally (!) lazily populates and maps LDT >> pages. An attempt to access a not-present page results in #PF with >> the error cod e indicating kernel access even if the access came from >> user mode. >> >> SDM volume 3 7.2.2 says "Pages corresponding to the previous task’s >> TSS, the current task’s TSS, and the descriptor table entries for >> each all should be marked as read/write." But I don't see how a CPU >> implementation could possibly care what the page table for the TSS >> descriptor table entries says after LTR is done because the CPU isn't >> even supposed to *read* that memory. >> >> OTOH a valid implementation could easily require that the page table >> says that the page is writable merely to load a segment, especially in >> weird cases (IRET?). That being said, this is all quite easy to test. >> >> Also, Thomas, why are you creating a new memory region? I don't see >> any benefit to randomizing the GDT address. How about just putting it >> in the fixmap? This would be NR_CPUS * 4 pages if do my limit=0xffff >> idea. I'm not sure if the fixmap code knows how to handle this much >> space. > > When I looked at the fixmap, you had to define the space you need > ahead of time and I am not sure there was enough space as you said. Can you try it and see if anything goes wrong? Even if something does go wrong, I think we should fix *that* rather than making the memory layout more complicated.
On Fri, Jan 6, 2017 at 10:03 AM, Thomas Garnier <thgarnie@google.com> wrote: > On Thu, Jan 5, 2017 at 10:49 PM, Ingo Molnar <mingo@kernel.org> wrote: >> >> * Thomas Garnier <thgarnie@google.com> wrote: >> >>> >> Not sure I fully understood and I don't want to miss an important point. Do >>> >> you mean making GDT (remapping and per-cpu) read-only and switch the >>> >> writeable flag only when we write to the per-cpu entry? >>> > >>> > What I mean is: write to the GDT through normal percpu access (or whatever the >>> > normal mapping is) but load a read-only alias into the GDT register. As long >>> > as nothing ever tries to write through the GDTR alias, no page faults will be >>> > generated. So we just need to make sure that nothing ever writes to it >>> > through GDTR. AFAIK the only reason the CPU ever writes to the address in >>> > GDTR is to set an accessed bit. >>> >>> A write is made when we use load_TR_desc (ltr). I didn't see any other yet. >> >> Is this write to the GDT, generated by the LTR instruction, done unconditionally >> by the hardware? >> > > That was my experience. I didn't look into details. Do you think we > could change something so that ltr never writes to the GDT? (just mark > the TSS entry busy). No, and I had the way this worked on 64-bit wrong. LTR requires an available TSS and changes it to busy. So here are my thoughts on how this should work: Let's get rid of any connection between this code and KASLR. Every time KASLR makes something work differently, a kitten turns all Schrödinger on us. This is moving the GDT to the fixmap, plain and simple. For now, make it one page per CPU and don't worry about the GDT limit. On 32-bit, we're going to have to make the fixmap GDT be read-write because making it read-only will break double-fault handling. On 64-bit, we can use your trick of temporarily mapping the GDT read-write every time we load TR, which should happen very rarely. Alternatively, we can reload the *GDT* every time we reload TR, which should be comparably slow. This is going to regress performance in the extremely rare case where KVM exits to a process that uses ioperm() (I think), but I doubt anyone cares. Or maybe we could arrange to never reload TR when GDT points at the fixmap by having KVM set the host GDT to the direct version and letting KVM's code to reload the GDT switch to the fixmap copy. If we need a quirk to keep the fixmap copy read-write, so be it. None of this should depend on KASLR. IMO it should happen unconditionally. Once all if it works, then we can build on it to allocate four pages per CPU (with the extra three pointing to the zero page) and speeding up KVM. --Andy
On Fri, Jan 6, 2017 at 1:59 PM, Andy Lutomirski <luto@kernel.org> wrote: > On Fri, Jan 6, 2017 at 10:03 AM, Thomas Garnier <thgarnie@google.com> wrote: >> On Thu, Jan 5, 2017 at 10:49 PM, Ingo Molnar <mingo@kernel.org> wrote: >>> >>> * Thomas Garnier <thgarnie@google.com> wrote: >>> >>>> >> Not sure I fully understood and I don't want to miss an important point. Do >>>> >> you mean making GDT (remapping and per-cpu) read-only and switch the >>>> >> writeable flag only when we write to the per-cpu entry? >>>> > >>>> > What I mean is: write to the GDT through normal percpu access (or whatever the >>>> > normal mapping is) but load a read-only alias into the GDT register. As long >>>> > as nothing ever tries to write through the GDTR alias, no page faults will be >>>> > generated. So we just need to make sure that nothing ever writes to it >>>> > through GDTR. AFAIK the only reason the CPU ever writes to the address in >>>> > GDTR is to set an accessed bit. >>>> >>>> A write is made when we use load_TR_desc (ltr). I didn't see any other yet. >>> >>> Is this write to the GDT, generated by the LTR instruction, done unconditionally >>> by the hardware? >>> >> >> That was my experience. I didn't look into details. Do you think we >> could change something so that ltr never writes to the GDT? (just mark >> the TSS entry busy). > > No, and I had the way this worked on 64-bit wrong. LTR requires an > available TSS and changes it to busy. So here are my thoughts on how > this should work: > > Let's get rid of any connection between this code and KASLR. Every > time KASLR makes something work differently, a kitten turns all > Schrödinger on us. This is moving the GDT to the fixmap, plain and > simple. For now, make it one page per CPU and don't worry about the > GDT limit. I am all for this change but that's more significant. Ingo: What do you think about that? > > On 32-bit, we're going to have to make the fixmap GDT be read-write > because making it read-only will break double-fault handling. > > On 64-bit, we can use your trick of temporarily mapping the GDT > read-write every time we load TR, which should happen very rarely. > Alternatively, we can reload the *GDT* every time we reload TR, which > should be comparably slow. This is going to regress performance in > the extremely rare case where KVM exits to a process that uses > ioperm() (I think), but I doubt anyone cares. Or maybe we could > arrange to never reload TR when GDT points at the fixmap by having KVM > set the host GDT to the direct version and letting KVM's code to > reload the GDT switch to the fixmap copy. > > If we need a quirk to keep the fixmap copy read-write, so be it. > > None of this should depend on KASLR. IMO it should happen unconditionally. > I looked back at the fixmap, and I can see a way it could be done (using NR_CPUS) like the other fixmap ranges. It would limit the number of cpus to 512 (there is 2M memory left on fixmap on the default configuration). That's if we never add any other fixmap on x64. I don't know if it is an acceptable number and if the fixmap region could be increased. (128 if we do your kvm trick, of course). Ingo: What do you think? > Once all if it works, then we can build on it to allocate four pages > per CPU (with the extra three pointing to the zero page) and speeding > up KVM. > > --Andy
On Fri, Jan 6, 2017 at 2:54 PM, Thomas Garnier <thgarnie@google.com> wrote: > On Fri, Jan 6, 2017 at 1:59 PM, Andy Lutomirski <luto@kernel.org> wrote: >> On Fri, Jan 6, 2017 at 10:03 AM, Thomas Garnier <thgarnie@google.com> wrote: >>> On Thu, Jan 5, 2017 at 10:49 PM, Ingo Molnar <mingo@kernel.org> wrote: >>>> >>>> * Thomas Garnier <thgarnie@google.com> wrote: >>>> >>>>> >> Not sure I fully understood and I don't want to miss an important point. Do >>>>> >> you mean making GDT (remapping and per-cpu) read-only and switch the >>>>> >> writeable flag only when we write to the per-cpu entry? >>>>> > >>>>> > What I mean is: write to the GDT through normal percpu access (or whatever the >>>>> > normal mapping is) but load a read-only alias into the GDT register. As long >>>>> > as nothing ever tries to write through the GDTR alias, no page faults will be >>>>> > generated. So we just need to make sure that nothing ever writes to it >>>>> > through GDTR. AFAIK the only reason the CPU ever writes to the address in >>>>> > GDTR is to set an accessed bit. >>>>> >>>>> A write is made when we use load_TR_desc (ltr). I didn't see any other yet. >>>> >>>> Is this write to the GDT, generated by the LTR instruction, done unconditionally >>>> by the hardware? >>>> >>> >>> That was my experience. I didn't look into details. Do you think we >>> could change something so that ltr never writes to the GDT? (just mark >>> the TSS entry busy). >> >> No, and I had the way this worked on 64-bit wrong. LTR requires an >> available TSS and changes it to busy. So here are my thoughts on how >> this should work: >> >> Let's get rid of any connection between this code and KASLR. Every >> time KASLR makes something work differently, a kitten turns all >> Schrödinger on us. This is moving the GDT to the fixmap, plain and >> simple. For now, make it one page per CPU and don't worry about the >> GDT limit. > > I am all for this change but that's more significant. > > Ingo: What do you think about that? > >> >> On 32-bit, we're going to have to make the fixmap GDT be read-write >> because making it read-only will break double-fault handling. >> >> On 64-bit, we can use your trick of temporarily mapping the GDT >> read-write every time we load TR, which should happen very rarely. >> Alternatively, we can reload the *GDT* every time we reload TR, which >> should be comparably slow. This is going to regress performance in >> the extremely rare case where KVM exits to a process that uses >> ioperm() (I think), but I doubt anyone cares. Or maybe we could >> arrange to never reload TR when GDT points at the fixmap by having KVM >> set the host GDT to the direct version and letting KVM's code to >> reload the GDT switch to the fixmap copy. >> >> If we need a quirk to keep the fixmap copy read-write, so be it. >> >> None of this should depend on KASLR. IMO it should happen unconditionally. >> > > I looked back at the fixmap, and I can see a way it could be done > (using NR_CPUS) like the other fixmap ranges. It would limit the > number of cpus to 512 (there is 2M memory left on fixmap on the > default configuration). That's if we never add any other fixmap on > x64. I don't know if it is an acceptable number and if the fixmap > region could be increased. (128 if we do your kvm trick, of course). > IIRC we need 4096 CPUs. But that 2M limit seems eminently fixable. I just tried sticking 4096 pages of nothing right near the top of the fixmap and the only problem I saw was that I had to move MODULES_END down a little bit. --Andy P.S. Let's do the move to the fixmap, read/write as a separate patch. That will make bisecting much easier.
* Thomas Garnier <thgarnie@google.com> wrote: > > No, and I had the way this worked on 64-bit wrong. LTR requires an > > available TSS and changes it to busy. So here are my thoughts on how > > this should work: > > > > Let's get rid of any connection between this code and KASLR. Every > > time KASLR makes something work differently, a kitten turns all > > Schrödinger on us. This is moving the GDT to the fixmap, plain and > > simple. For now, make it one page per CPU and don't worry about the > > GDT limit. > > I am all for this change but that's more significant. > > Ingo: What do you think about that? I agree with Andy: as I alluded to earlier as well this should be an unconditional change (tested properly, etc.) that robustifies the GDT mapping for everyone. That KASLR kernels improve too is a happy side effect! > > On 32-bit, we're going to have to make the fixmap GDT be read-write because > > making it read-only will break double-fault handling. > > > > On 64-bit, we can use your trick of temporarily mapping the GDT read-write > > every time we load TR, which should happen very rarely. Alternatively, we can > > reload the *GDT* every time we reload TR, which should be comparably slow. > > This is going to regress performance in the extremely rare case where KVM > > exits to a process that uses ioperm() (I think), but I doubt anyone cares. Or > > maybe we could arrange to never reload TR when GDT points at the fixmap by > > having KVM set the host GDT to the direct version and letting KVM's code to > > reload the GDT switch to the fixmap copy. Please check whether the LTR write generates a page fault to a RO PTE even if the busy bit is already set. LTR is pretty slow which suggests that it's microcode, and microcode is usually not sloppy about such things: i.e. LTR would only generate an unconditional write if there's a compatibility dependency on it. But I could easily be wrong ... > > If we need a quirk to keep the fixmap copy read-write, so be it. > > > > None of this should depend on KASLR. IMO it should happen unconditionally. > > I looked back at the fixmap, and I can see a way it could be done > (using NR_CPUS) like the other fixmap ranges. It would limit the > number of cpus to 512 (there is 2M memory left on fixmap on the > default configuration). That's if we never add any other fixmap on > x64. I don't know if it is an acceptable number and if the fixmap > region could be increased. (128 if we do your kvm trick, of course). > > Ingo: What do you think? I think we should scale the fixmap size flexibly with NR_CPUs on 64-bit, and we should limit CPUs on 32-bit to a reasonable value. I.e. let's just do it, if we run into problems it's all solvable AFAICS. Thanks, Ingo
* Andy Lutomirski <luto@kernel.org> wrote: > > I looked back at the fixmap, and I can see a way it could be done (using > > NR_CPUS) like the other fixmap ranges. It would limit the number of cpus to > > 512 (there is 2M memory left on fixmap on the default configuration). That's > > if we never add any other fixmap on x64. I don't know if it is an acceptable > > number and if the fixmap region could be increased. (128 if we do your kvm > > trick, of course). > > IIRC we need 4096 CPUs. On 64-bit the limit is 8192 CPUs, and the SGI guys are relying on that up to the tune of 6144 cores already, and I'd say the 64-bit CPU count is likely to go up further with 5-level paging. On 32-bit the reasonable CPU limit is the number that the Intel 32-bit cluster computing nodes use. The latest public numbers are I think 36 'tiles' with each tile being a 2-CPU SMT core - i.e. a limit of 72 CPUs has to be maintained. (They'll obviously go to 64-bit as well so this problem will go away in a hardware generation or two.) So I'd say 128 CPUs on 32-bit should be a reasonable practical limit going forward. Right now our 32-bit limit is 512 CPUs IIRC, but I don't think any real hardware in production is reaching that. > P.S. Let's do the move to the fixmap, read/write as a separate patch. That will > make bisecting much easier. Absolutely, but this has to be within the same series, as the interim fixmap-only step is less secure in some circumstances: we are moving the writable GDT from a previously randomized location to a fixed location. Thanks, Ingo
* Andy Lutomirski <luto@kernel.org> wrote: > > When I looked at the fixmap, you had to define the space you need ahead of > > time and I am not sure there was enough space as you said. > > Can you try it and see if anything goes wrong? Even if something does go wrong, > I think we should fix *that* rather than making the memory layout more > complicated. Absolutely! This should always be the driving principle when complicating the kernel's memory layout. Thanks, Ingo
On Fri, Jan 6, 2017 at 11:45 PM, Ingo Molnar <mingo@kernel.org> wrote: > > * Andy Lutomirski <luto@kernel.org> wrote: >> P.S. Let's do the move to the fixmap, read/write as a separate patch. That will >> make bisecting much easier. > > Absolutely, but this has to be within the same series, as the interim fixmap-only > step is less secure in some circumstances: we are moving the writable GDT from a > previously randomized location to a fixed location. True, but despite being randomized its location was never even remotely secret. (Except on Kaby Lake or Foobar Lake or whatever CPU that is.) --Andy
On Fri, Jan 6, 2017 at 11:35 PM, Ingo Molnar <mingo@kernel.org> wrote: > > * Thomas Garnier <thgarnie@google.com> wrote: > >> > No, and I had the way this worked on 64-bit wrong. LTR requires an >> > available TSS and changes it to busy. So here are my thoughts on how >> > this should work: >> > >> > Let's get rid of any connection between this code and KASLR. Every >> > time KASLR makes something work differently, a kitten turns all >> > Schrödinger on us. This is moving the GDT to the fixmap, plain and >> > simple. For now, make it one page per CPU and don't worry about the >> > GDT limit. >> >> I am all for this change but that's more significant. >> >> Ingo: What do you think about that? > > I agree with Andy: as I alluded to earlier as well this should be an unconditional > change (tested properly, etc.) that robustifies the GDT mapping for everyone. That > KASLR kernels improve too is a happy side effect! > >> > On 32-bit, we're going to have to make the fixmap GDT be read-write because >> > making it read-only will break double-fault handling. >> > >> > On 64-bit, we can use your trick of temporarily mapping the GDT read-write >> > every time we load TR, which should happen very rarely. Alternatively, we can >> > reload the *GDT* every time we reload TR, which should be comparably slow. >> > This is going to regress performance in the extremely rare case where KVM >> > exits to a process that uses ioperm() (I think), but I doubt anyone cares. Or >> > maybe we could arrange to never reload TR when GDT points at the fixmap by >> > having KVM set the host GDT to the direct version and letting KVM's code to >> > reload the GDT switch to the fixmap copy. > > Please check whether the LTR write generates a page fault to a RO PTE even if the > busy bit is already set. LTR is pretty slow which suggests that it's microcode, > and microcode is usually not sloppy about such things: i.e. LTR would only > generate an unconditional write if there's a compatibility dependency on it. But I > could easily be wrong ... The SDM says: IF segment descriptor is not for an available TSS THEN #GP(segment selector); FI; so I think it's #GP not #PF. > >> > If we need a quirk to keep the fixmap copy read-write, so be it. >> > >> > None of this should depend on KASLR. IMO it should happen unconditionally. >> >> I looked back at the fixmap, and I can see a way it could be done >> (using NR_CPUS) like the other fixmap ranges. It would limit the >> number of cpus to 512 (there is 2M memory left on fixmap on the >> default configuration). That's if we never add any other fixmap on >> x64. I don't know if it is an acceptable number and if the fixmap >> region could be increased. (128 if we do your kvm trick, of course). >> >> Ingo: What do you think? > > I think we should scale the fixmap size flexibly with NR_CPUs on 64-bit, and we > should limit CPUs on 32-bit to a reasonable value. Unless the headers are even more tangled than usual, I think we could just handle this exactly. The top and bottom of the fixmap are known exactly at compile time, so we should just be able to make the next mm range adjust its start point accordingly. The main issue right now seems to be that MODULES_END is hard-coded. For 32-bit, the task gate in #DF is going to be a show-stopper for an RO fixmap, I think. I'm still slightly in favor of enabling the code but with an RW mapping on 32-bit, though. And anyone running hundreds to thousands of CPUs on 32-bit is nuts anyway. --Andy
On Fri, Jan 6, 2017 at 11:35 PM, Ingo Molnar <mingo@kernel.org> wrote: > > * Thomas Garnier <thgarnie@google.com> wrote: > >> > No, and I had the way this worked on 64-bit wrong. LTR requires an >> > available TSS and changes it to busy. So here are my thoughts on how >> > this should work: >> > >> > Let's get rid of any connection between this code and KASLR. Every >> > time KASLR makes something work differently, a kitten turns all >> > Schrödinger on us. This is moving the GDT to the fixmap, plain and >> > simple. For now, make it one page per CPU and don't worry about the >> > GDT limit. >> >> I am all for this change but that's more significant. >> >> Ingo: What do you think about that? > > I agree with Andy: as I alluded to earlier as well this should be an unconditional > change (tested properly, etc.) that robustifies the GDT mapping for everyone. That > KASLR kernels improve too is a happy side effect! > >> > On 32-bit, we're going to have to make the fixmap GDT be read-write because >> > making it read-only will break double-fault handling. >> > >> > On 64-bit, we can use your trick of temporarily mapping the GDT read-write >> > every time we load TR, which should happen very rarely. Alternatively, we can >> > reload the *GDT* every time we reload TR, which should be comparably slow. >> > This is going to regress performance in the extremely rare case where KVM >> > exits to a process that uses ioperm() (I think), but I doubt anyone cares. Or >> > maybe we could arrange to never reload TR when GDT points at the fixmap by >> > having KVM set the host GDT to the direct version and letting KVM's code to >> > reload the GDT switch to the fixmap copy. > > Please check whether the LTR write generates a page fault to a RO PTE even if the > busy bit is already set. LTR is pretty slow which suggests that it's microcode, > and microcode is usually not sloppy about such things: i.e. LTR would only > generate an unconditional write if there's a compatibility dependency on it. But I > could easily be wrong ... > Coming back on that after a bit more testing. The LTR instruction check if the busy bit is already set, if already set then it will just issue a #GP given a bad selector: [ 0.000000] general protection fault: 0040 [#1] SMP ... [ 0.000000] RIP: 0010:native_load_tr_desc+0x9/0x10 ... [ 0.000000] Call Trace: [ 0.000000] cpu_init+0x2d0/0x3c0 [ 0.000000] trap_init+0x2a2/0x312 [ 0.000000] start_kernel+0x1fb/0x43b [ 0.000000] ? set_init_arg+0x55/0x55 [ 0.000000] ? early_idt_handler_array+0x120/0x120 [ 0.000000] x86_64_start_reservations+0x2a/0x2c [ 0.000000] x86_64_start_kernel+0x13d/0x14c [ 0.000000] start_cpu+0x14/0x14 I assume that's in this part of the pseudo-code: if(!IsWithinDescriptorTableLimit(Source.Offset) || Source.Type != TypeGlobal) Exception(GP(SegmentSelector)); SegmentDescriptor = ReadSegmentDescriptor(); if(!IsForAnAvailableTSS(SegmentDescriptor)) Exception(GP(SegmentSelector)); <---- That's where I got the GP TSSSegmentDescriptor.Busy = 1; <------------------------------------------------------------------ That's the pagefault I get otherwise //Locked read-modify-write operation on the entire descriptor when setting busy flag TaskRegister.SegmentSelector = Source; TaskRegister.SegmentDescriptor.TSSSegmentDescriptor; I assume the best option would be to make the remap read-write for the LTR instruction. What do you think? >> > If we need a quirk to keep the fixmap copy read-write, so be it. >> > >> > None of this should depend on KASLR. IMO it should happen unconditionally. >> >> I looked back at the fixmap, and I can see a way it could be done >> (using NR_CPUS) like the other fixmap ranges. It would limit the >> number of cpus to 512 (there is 2M memory left on fixmap on the >> default configuration). That's if we never add any other fixmap on >> x64. I don't know if it is an acceptable number and if the fixmap >> region could be increased. (128 if we do your kvm trick, of course). >> >> Ingo: What do you think? > > I think we should scale the fixmap size flexibly with NR_CPUs on 64-bit, and we > should limit CPUs on 32-bit to a reasonable value. > > I.e. let's just do it, if we run into problems it's all solvable AFAICS. > > Thanks, > > Ingo
* Thomas Garnier <thgarnie@google.com> wrote: > Coming back on that after a bit more testing. The LTR instruction > check if the busy bit is already set, if already set then it will just > issue a #GP given a bad selector: > > [ 0.000000] general protection fault: 0040 [#1] SMP > ... > [ 0.000000] RIP: 0010:native_load_tr_desc+0x9/0x10 > ... > [ 0.000000] Call Trace: > [ 0.000000] cpu_init+0x2d0/0x3c0 > [ 0.000000] trap_init+0x2a2/0x312 > [ 0.000000] start_kernel+0x1fb/0x43b > [ 0.000000] ? set_init_arg+0x55/0x55 > [ 0.000000] ? early_idt_handler_array+0x120/0x120 > [ 0.000000] x86_64_start_reservations+0x2a/0x2c > [ 0.000000] x86_64_start_kernel+0x13d/0x14c > [ 0.000000] start_cpu+0x14/0x14 > > I assume that's in this part of the pseudo-code: > > if(!IsWithinDescriptorTableLimit(Source.Offset) || Source.Type != > TypeGlobal) Exception(GP(SegmentSelector)); > SegmentDescriptor = ReadSegmentDescriptor(); > if(!IsForAnAvailableTSS(SegmentDescriptor)) > Exception(GP(SegmentSelector)); <---- That's where I got the GP > TSSSegmentDescriptor.Busy = 1; > <------------------------------------------------------------------ > That's the pagefault I get otherwise > //Locked read-modify-write operation on the entire descriptor when > setting busy flag > TaskRegister.SegmentSelector = Source; > TaskRegister.SegmentDescriptor.TSSSegmentDescriptor; > > I assume the best option would be to make the remap read-write for the > LTR instruction. What do you think? So if LTR does not modify the GDT if the busy bit is already set, why don't we set the busy bit in the descriptor (via the linear mapping rw alias). Then the remapped GDT can stay read-only all the time and LTR won't fault. Am I missing something here? Thanks, Ingo
On Tue, Jan 10, 2017 at 2:27 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Thomas Garnier <thgarnie@google.com> wrote: > >> Coming back on that after a bit more testing. The LTR instruction >> check if the busy bit is already set, if already set then it will just >> issue a #GP given a bad selector: >> >> [ 0.000000] general protection fault: 0040 [#1] SMP >> ... >> [ 0.000000] RIP: 0010:native_load_tr_desc+0x9/0x10 >> ... >> [ 0.000000] Call Trace: >> [ 0.000000] cpu_init+0x2d0/0x3c0 >> [ 0.000000] trap_init+0x2a2/0x312 >> [ 0.000000] start_kernel+0x1fb/0x43b >> [ 0.000000] ? set_init_arg+0x55/0x55 >> [ 0.000000] ? early_idt_handler_array+0x120/0x120 >> [ 0.000000] x86_64_start_reservations+0x2a/0x2c >> [ 0.000000] x86_64_start_kernel+0x13d/0x14c >> [ 0.000000] start_cpu+0x14/0x14 >> >> I assume that's in this part of the pseudo-code: >> >> if(!IsWithinDescriptorTableLimit(Source.Offset) || Source.Type != >> TypeGlobal) Exception(GP(SegmentSelector)); >> SegmentDescriptor = ReadSegmentDescriptor(); >> if(!IsForAnAvailableTSS(SegmentDescriptor)) >> Exception(GP(SegmentSelector)); <---- That's where I got the GP >> TSSSegmentDescriptor.Busy = 1; >> <------------------------------------------------------------------ >> That's the pagefault I get otherwise >> //Locked read-modify-write operation on the entire descriptor when >> setting busy flag >> TaskRegister.SegmentSelector = Source; >> TaskRegister.SegmentDescriptor.TSSSegmentDescriptor; >> >> I assume the best option would be to make the remap read-write for the >> LTR instruction. What do you think? > > So if LTR does not modify the GDT if the busy bit is already set, why don't we set > the busy bit in the descriptor (via the linear mapping rw alias). > > Then the remapped GDT can stay read-only all the time and LTR won't fault. > > Am I missing something here? > Sorry, I may not have explained myself well. If you set the busy bit on the GDT TSS entry, you get a #GP on LTR. When we use the LTR instruction, we need the GDT to be writeable. I think I can handle it by switching to the remap GDT after load_TR_desc. I will try this approach. > Thanks, > > Ingo
diff --git a/arch/x86/include/asm/kaslr.h b/arch/x86/include/asm/kaslr.h index 1052a797d71d..babc32803182 100644 --- a/arch/x86/include/asm/kaslr.h +++ b/arch/x86/include/asm/kaslr.h @@ -9,8 +9,12 @@ extern unsigned long vmalloc_base; extern unsigned long vmemmap_base; void kernel_randomize_memory(void); +void kernel_randomize_smp(void); +void* kaslr_get_gdt_remap(int cpu); #else static inline void kernel_randomize_memory(void) { } +static inline void kernel_randomize_smp(void) { } +static inline void *kaslr_get_gdt_remap(int cpu) { return NULL; } #endif /* CONFIG_RANDOMIZE_MEMORY */ #endif diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index dc1697ca5191..2c8a7b4718ea 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -450,8 +450,13 @@ void load_percpu_segment(int cpu) void switch_to_new_gdt(int cpu) { struct desc_ptr gdt_descr; + struct desc_struct *gdt; - gdt_descr.address = (long)get_cpu_gdt_table(cpu); + gdt = kaslr_get_gdt_remap(cpu); + if (!gdt) + gdt = get_cpu_gdt_table(cpu); + + gdt_descr.address = (long)gdt; gdt_descr.size = GDT_SIZE - 1; load_gdt(&gdt_descr); /* Reload the per-cpu base */ diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c index ea9c49adaa1f..213fe01f28dc 100644 --- a/arch/x86/mm/dump_pagetables.c +++ b/arch/x86/mm/dump_pagetables.c @@ -50,6 +50,9 @@ enum address_markers_idx { LOW_KERNEL_NR, VMALLOC_START_NR, VMEMMAP_START_NR, +# ifdef CONFIG_RANDOMIZE_MEMORY + GDT_REMAP_NR, +# endif # ifdef CONFIG_X86_ESPFIX64 ESPFIX_START_NR, # endif @@ -75,6 +78,9 @@ static struct addr_marker address_markers[] = { { 0/* PAGE_OFFSET */, "Low Kernel Mapping" }, { 0/* VMALLOC_START */, "vmalloc() Area" }, { 0/* VMEMMAP_START */, "Vmemmap" }, +# ifdef CONFIG_RANDOMIZE_MEMORY + { 0, "GDT remapping" }, +# endif # ifdef CONFIG_X86_ESPFIX64 { ESPFIX_BASE_ADDR, "ESPfix Area", 16 }, # endif @@ -442,6 +448,10 @@ static int __init pt_dump_init(void) address_markers[LOW_KERNEL_NR].start_address = PAGE_OFFSET; address_markers[VMALLOC_START_NR].start_address = VMALLOC_START; address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START; +#ifdef CONFIG_RANDOMIZE_MEMORY + address_markers[GDT_REMAP_NR].start_address = + (unsigned long) kaslr_get_gdt_remap(0); +#endif #endif #ifdef CONFIG_X86_32 address_markers[VMALLOC_START_NR].start_address = VMALLOC_START; diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c index 887e57182716..db1bdb75f8af 100644 --- a/arch/x86/mm/kaslr.c +++ b/arch/x86/mm/kaslr.c @@ -22,11 +22,13 @@ #include <linux/kernel.h> #include <linux/init.h> #include <linux/random.h> +#include <linux/slab.h> #include <asm/pgalloc.h> #include <asm/pgtable.h> #include <asm/setup.h> #include <asm/kaslr.h> +#include <asm/desc.h> #include "mm_internal.h" @@ -60,6 +62,7 @@ unsigned long vmalloc_base = __VMALLOC_BASE; EXPORT_SYMBOL(vmalloc_base); unsigned long vmemmap_base = __VMEMMAP_BASE; EXPORT_SYMBOL(vmemmap_base); +unsigned long gdt_tables_base = 0; /* * Memory regions randomized by KASLR (except modules that use a separate logic @@ -97,7 +100,7 @@ void __init kernel_randomize_memory(void) unsigned long vaddr = vaddr_start; unsigned long rand, memory_tb; struct rnd_state rand_state; - unsigned long remain_entropy; + unsigned long remain_entropy, gdt_reserved; /* * All these BUILD_BUG_ON checks ensures the memory layout is @@ -131,6 +134,13 @@ void __init kernel_randomize_memory(void) for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++) remain_entropy -= get_padding(&kaslr_regions[i]); + /* Reserve space for fixed GDTs, if we have enough available */ + gdt_reserved = sizeof(struct gdt_page) * max(setup_max_cpus, 1U); + if (gdt_reserved < remain_entropy) { + gdt_tables_base = vaddr_end - gdt_reserved; + remain_entropy -= gdt_reserved; + } + prandom_seed_state(&rand_state, kaslr_get_random_long("Memory")); for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++) { @@ -192,3 +202,98 @@ void __meminit init_trampoline(void) set_pgd(&trampoline_pgd_entry, __pgd(_KERNPG_TABLE | __pa(pud_page_tramp))); } + +/* Hold the remapping of the gdt page for each cpu */ +DEFINE_PER_CPU_PAGE_ALIGNED(struct desc_struct *, gdt_remap); + +/* Return the address where the GDT is remapped for this CPU */ +static unsigned long gdt_remap_address(int cpu) +{ + return gdt_tables_base + cpu * sizeof(struct gdt_page); +} + +/* Remap the specified gdt table */ +static struct desc_struct *remap_gdt(int cpu) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + struct desc_struct *gdt; + unsigned long addr; + + /* GDT table should be only one page */ + BUILD_BUG_ON(sizeof(struct gdt_page) != PAGE_SIZE); + + /* Keep the original GDT before the allocator is available */ + if (!slab_is_available()) + return NULL; + + gdt = get_cpu_gdt_table(cpu); + addr = gdt_remap_address(cpu); + + pgd = pgd_offset_k(addr); + pud = pud_alloc(&init_mm, pgd, addr); + if (WARN_ON(!pud)) + return NULL; + pmd = pmd_alloc(&init_mm, pud, addr); + if (WARN_ON(!pmd)) + return NULL; + pte = pte_alloc_kernel(pmd, addr); + if (WARN_ON(!pte)) + return NULL; + + /* If the PTE is already set, something is wrong with the VA ranges */ + BUG_ON(!pte_none(*pte)); + + /* Remap the target GDT and return it */ + set_pte_at(&init_mm, addr, pte, + pfn_pte(PFN_DOWN(__pa(gdt)), PAGE_KERNEL)); + gdt = (struct desc_struct *)addr; + per_cpu(gdt_remap, cpu) = gdt; + return gdt; +} + +/* Check if GDT remapping is enabled */ +static bool kaslr_gdt_remap_enabled(void) +{ + return kaslr_memory_enabled() && gdt_tables_base != 0; +} + +/* + * The GDT table address is available to user-mode through the sgdt + * instruction. This function will return a fixed remapping to load so you + * cannot leak the per-cpu structure address. + */ +void* kaslr_get_gdt_remap(int cpu) +{ + struct desc_struct *gdt_remapping; + + if (!kaslr_gdt_remap_enabled()) + return NULL; + + gdt_remapping = per_cpu(gdt_remap, cpu); + if (!gdt_remapping) + gdt_remapping = remap_gdt(cpu); + + return gdt_remapping; +} + +/* + * Switch the first processor GDT to the remapping. The GDT is loaded too early + * to generate the remapping correctly. This step is done later at boot or + * before other processors come back from hibernation. + */ +void kernel_randomize_smp(void) +{ + struct desc_ptr gdt_descr; + struct desc_struct *gdt; + + gdt = kaslr_get_gdt_remap(raw_smp_processor_id()); + if (WARN_ON(!gdt)) + return; + + gdt_descr.address = (long)gdt; + gdt_descr.size = GDT_SIZE - 1; + load_gdt(&gdt_descr); +} diff --git a/kernel/cpu.c b/kernel/cpu.c index f75c4d031eeb..4d6979299b9a 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1040,6 +1040,9 @@ void enable_nonboot_cpus(void) { int cpu, error; + /* Redo KASLR steps for main processor */ + kernel_randomize_smp(); + /* Allow everyone to use the CPU hotplug again */ cpu_maps_update_begin(); __cpu_hotplug_enable(); diff --git a/kernel/smp.c b/kernel/smp.c index 77fcdb9f2775..e1ef8d05e179 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -554,6 +554,7 @@ void __init smp_init(void) idle_threads_init(); cpuhp_threads_init(); + kernel_randomize_smp(); pr_info("Bringing up secondary CPUs ...\n");
Each processor holds a GDT in its per-cpu structure. The sgdt instruction gives the base address of the current GDT. This address can be used to bypass KASLR memory randomization. With another bug, an attacker could target other per-cpu structures or deduce the base of the main memory section (PAGE_OFFSET). In this change, a space is reserved at the end of the memory range available for KASLR memory randomization. The space is big enough to hold the maximum number of CPUs (as defined by setup_max_cpus). Each GDT is mapped at specific offset based on the target CPU. Note that if there is not enough space available, the GDTs are not remapped. The document was changed to mention GDT remapping for KASLR. This patch also include dump page tables support. This patch was tested on multiple hardware configurations and for hibernation support. Signed-off-by: Thomas Garnier <thgarnie@google.com> --- Based on next-20170104 --- arch/x86/include/asm/kaslr.h | 4 ++ arch/x86/kernel/cpu/common.c | 7 ++- arch/x86/mm/dump_pagetables.c | 10 ++++ arch/x86/mm/kaslr.c | 107 +++++++++++++++++++++++++++++++++++++++++- kernel/cpu.c | 3 ++ kernel/smp.c | 1 + 6 files changed, 130 insertions(+), 2 deletions(-)