Message ID | 20190117003259.23141-7-rick.p.edgecombe@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Merge text_poke fixes and executable lockdowns | expand |
On Wed, Jan 16, 2019 at 4:33 PM Rick Edgecombe <rick.p.edgecombe@intel.com> wrote: > > From: Nadav Amit <namit@vmware.com> > > text_poke() can potentially compromise the security as it sets temporary > PTEs in the fixmap. These PTEs might be used to rewrite the kernel code > from other cores accidentally or maliciously, if an attacker gains the > ability to write onto kernel memory. i think this may be sufficient, but barely. > + pte_clear(poking_mm, poking_addr, ptep); > + > + /* > + * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on, > + * as it also flushes the corresponding "user" address spaces, which > + * does not exist. > + * > + * Poking, however, is already very inefficient since it does not try to > + * batch updates, so we ignore this problem for the time being. > + * > + * Since the PTEs do not exist in other kernel address-spaces, we do > + * not use __flush_tlb_one_kernel(), which when PTI is on would cause > + * more unwarranted TLB flushes. > + * > + * There is a slight anomaly here: the PTE is a supervisor-only and > + * (potentially) global and we use __flush_tlb_one_user() but this > + * should be fine. > + */ > + __flush_tlb_one_user(poking_addr); > + if (cross_page_boundary) { > + pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1); > + __flush_tlb_one_user(poking_addr + PAGE_SIZE); > + } In principle, another CPU could still have the old translation. Your mutex probably makes this impossible, but it makes me nervous. Ideally you'd use flush_tlb_mm_range(), but I guess you can't do that with IRQs off. Hmm. I think you should add an inc_mm_tlb_gen() here. Arguably, if you did that, you could omit the flushes, but maybe that's silly. If we start getting new users of use_temporary_mm(), we should give some serious thought to the SMP semantics. Also, you're using PAGE_KERNEL. Please tell me that the global bit isn't set in there. --Andy
On Thu, Jan 17, 2019 at 12:27 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Wed, Jan 16, 2019 at 4:33 PM Rick Edgecombe > <rick.p.edgecombe@intel.com> wrote: > > > > From: Nadav Amit <namit@vmware.com> > > > > text_poke() can potentially compromise the security as it sets temporary > > PTEs in the fixmap. These PTEs might be used to rewrite the kernel code > > from other cores accidentally or maliciously, if an attacker gains the > > ability to write onto kernel memory. > > i think this may be sufficient, but barely. > > > + pte_clear(poking_mm, poking_addr, ptep); > > + > > + /* > > + * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on, > > + * as it also flushes the corresponding "user" address spaces, which > > + * does not exist. > > + * > > + * Poking, however, is already very inefficient since it does not try to > > + * batch updates, so we ignore this problem for the time being. > > + * > > + * Since the PTEs do not exist in other kernel address-spaces, we do > > + * not use __flush_tlb_one_kernel(), which when PTI is on would cause > > + * more unwarranted TLB flushes. > > + * > > + * There is a slight anomaly here: the PTE is a supervisor-only and > > + * (potentially) global and we use __flush_tlb_one_user() but this > > + * should be fine. > > + */ > > + __flush_tlb_one_user(poking_addr); > > + if (cross_page_boundary) { > > + pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1); > > + __flush_tlb_one_user(poking_addr + PAGE_SIZE); > > + } > > In principle, another CPU could still have the old translation. Your > mutex probably makes this impossible, but it makes me nervous. > Ideally you'd use flush_tlb_mm_range(), but I guess you can't do that > with IRQs off. Hmm. I think you should add an inc_mm_tlb_gen() here. > Arguably, if you did that, you could omit the flushes, but maybe > that's silly. > > If we start getting new users of use_temporary_mm(), we should give > some serious thought to the SMP semantics. > > Also, you're using PAGE_KERNEL. Please tell me that the global bit > isn't set in there. > Much better solution: do unuse_temporary_mm() and *then* flush_tlb_mm_range(). This is entirely non-sketchy and should be just about optimal, too. --Andy
> On Jan 17, 2019, at 12:47 PM, Andy Lutomirski <luto@kernel.org> wrote: > > On Thu, Jan 17, 2019 at 12:27 PM Andy Lutomirski <luto@kernel.org> wrote: >> On Wed, Jan 16, 2019 at 4:33 PM Rick Edgecombe >> <rick.p.edgecombe@intel.com> wrote: >>> From: Nadav Amit <namit@vmware.com> >>> >>> text_poke() can potentially compromise the security as it sets temporary >>> PTEs in the fixmap. These PTEs might be used to rewrite the kernel code >>> from other cores accidentally or maliciously, if an attacker gains the >>> ability to write onto kernel memory. >> >> i think this may be sufficient, but barely. >> >>> + pte_clear(poking_mm, poking_addr, ptep); >>> + >>> + /* >>> + * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on, >>> + * as it also flushes the corresponding "user" address spaces, which >>> + * does not exist. >>> + * >>> + * Poking, however, is already very inefficient since it does not try to >>> + * batch updates, so we ignore this problem for the time being. >>> + * >>> + * Since the PTEs do not exist in other kernel address-spaces, we do >>> + * not use __flush_tlb_one_kernel(), which when PTI is on would cause >>> + * more unwarranted TLB flushes. >>> + * >>> + * There is a slight anomaly here: the PTE is a supervisor-only and >>> + * (potentially) global and we use __flush_tlb_one_user() but this >>> + * should be fine. >>> + */ >>> + __flush_tlb_one_user(poking_addr); >>> + if (cross_page_boundary) { >>> + pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1); >>> + __flush_tlb_one_user(poking_addr + PAGE_SIZE); >>> + } >> >> In principle, another CPU could still have the old translation. Your >> mutex probably makes this impossible, but it makes me nervous. >> Ideally you'd use flush_tlb_mm_range(), but I guess you can't do that >> with IRQs off. Hmm. I think you should add an inc_mm_tlb_gen() here. >> Arguably, if you did that, you could omit the flushes, but maybe >> that's silly. >> >> If we start getting new users of use_temporary_mm(), we should give >> some serious thought to the SMP semantics. >> >> Also, you're using PAGE_KERNEL. Please tell me that the global bit >> isn't set in there. > > Much better solution: do unuse_temporary_mm() and *then* > flush_tlb_mm_range(). This is entirely non-sketchy and should be just > about optimal, too. This solution sounds nice and clean. The fact the global-bit was set didn’t matter before (since __flush_tlb_one_user would get rid of it no matter what), but would matter now, so I’ll change it too. Thanks! Nadav
> On Jan 17, 2019, at 1:43 PM, Nadav Amit <nadav.amit@gmail.com> wrote: > >> On Jan 17, 2019, at 12:47 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> On Thu, Jan 17, 2019 at 12:27 PM Andy Lutomirski <luto@kernel.org> wrote: >>> On Wed, Jan 16, 2019 at 4:33 PM Rick Edgecombe >>> <rick.p.edgecombe@intel.com> wrote: >>>> From: Nadav Amit <namit@vmware.com> >>>> >>>> text_poke() can potentially compromise the security as it sets temporary >>>> PTEs in the fixmap. These PTEs might be used to rewrite the kernel code >>>> from other cores accidentally or maliciously, if an attacker gains the >>>> ability to write onto kernel memory. >>> >>> i think this may be sufficient, but barely. >>> >>>> + pte_clear(poking_mm, poking_addr, ptep); >>>> + >>>> + /* >>>> + * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on, >>>> + * as it also flushes the corresponding "user" address spaces, which >>>> + * does not exist. >>>> + * >>>> + * Poking, however, is already very inefficient since it does not try to >>>> + * batch updates, so we ignore this problem for the time being. >>>> + * >>>> + * Since the PTEs do not exist in other kernel address-spaces, we do >>>> + * not use __flush_tlb_one_kernel(), which when PTI is on would cause >>>> + * more unwarranted TLB flushes. >>>> + * >>>> + * There is a slight anomaly here: the PTE is a supervisor-only and >>>> + * (potentially) global and we use __flush_tlb_one_user() but this >>>> + * should be fine. >>>> + */ >>>> + __flush_tlb_one_user(poking_addr); >>>> + if (cross_page_boundary) { >>>> + pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1); >>>> + __flush_tlb_one_user(poking_addr + PAGE_SIZE); >>>> + } >>> >>> In principle, another CPU could still have the old translation. Your >>> mutex probably makes this impossible, but it makes me nervous. >>> Ideally you'd use flush_tlb_mm_range(), but I guess you can't do that >>> with IRQs off. Hmm. I think you should add an inc_mm_tlb_gen() here. >>> Arguably, if you did that, you could omit the flushes, but maybe >>> that's silly. >>> >>> If we start getting new users of use_temporary_mm(), we should give >>> some serious thought to the SMP semantics. >>> >>> Also, you're using PAGE_KERNEL. Please tell me that the global bit >>> isn't set in there. >> >> Much better solution: do unuse_temporary_mm() and *then* >> flush_tlb_mm_range(). This is entirely non-sketchy and should be just >> about optimal, too. > > This solution sounds nice and clean. The fact the global-bit was set didn’t > matter before (since __flush_tlb_one_user would get rid of it no matter > what), but would matter now, so I’ll change it too. Err.. so actually text_poke() might be called with disabled IRQs (by kgdb). flush_tlb_mm_range() should still work fine even with disabled IRQs since no core would use poking_mm at this point. I can add a comment to flush_tlb_mm_range(), but all in all it is actually not very pretty.
On January 17, 2019 1:43:54 PM PST, Nadav Amit <nadav.amit@gmail.com> wrote: >> On Jan 17, 2019, at 12:47 PM, Andy Lutomirski <luto@kernel.org> >wrote: >> >> On Thu, Jan 17, 2019 at 12:27 PM Andy Lutomirski <luto@kernel.org> >wrote: >>> On Wed, Jan 16, 2019 at 4:33 PM Rick Edgecombe >>> <rick.p.edgecombe@intel.com> wrote: >>>> From: Nadav Amit <namit@vmware.com> >>>> >>>> text_poke() can potentially compromise the security as it sets >temporary >>>> PTEs in the fixmap. These PTEs might be used to rewrite the kernel >code >>>> from other cores accidentally or maliciously, if an attacker gains >the >>>> ability to write onto kernel memory. >>> >>> i think this may be sufficient, but barely. >>> >>>> + pte_clear(poking_mm, poking_addr, ptep); >>>> + >>>> + /* >>>> + * __flush_tlb_one_user() performs a redundant TLB flush >when PTI is on, >>>> + * as it also flushes the corresponding "user" address >spaces, which >>>> + * does not exist. >>>> + * >>>> + * Poking, however, is already very inefficient since it >does not try to >>>> + * batch updates, so we ignore this problem for the time >being. >>>> + * >>>> + * Since the PTEs do not exist in other kernel >address-spaces, we do >>>> + * not use __flush_tlb_one_kernel(), which when PTI is on >would cause >>>> + * more unwarranted TLB flushes. >>>> + * >>>> + * There is a slight anomaly here: the PTE is a >supervisor-only and >>>> + * (potentially) global and we use __flush_tlb_one_user() >but this >>>> + * should be fine. >>>> + */ >>>> + __flush_tlb_one_user(poking_addr); >>>> + if (cross_page_boundary) { >>>> + pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep >+ 1); >>>> + __flush_tlb_one_user(poking_addr + PAGE_SIZE); >>>> + } >>> >>> In principle, another CPU could still have the old translation. >Your >>> mutex probably makes this impossible, but it makes me nervous. >>> Ideally you'd use flush_tlb_mm_range(), but I guess you can't do >that >>> with IRQs off. Hmm. I think you should add an inc_mm_tlb_gen() >here. >>> Arguably, if you did that, you could omit the flushes, but maybe >>> that's silly. >>> >>> If we start getting new users of use_temporary_mm(), we should give >>> some serious thought to the SMP semantics. >>> >>> Also, you're using PAGE_KERNEL. Please tell me that the global bit >>> isn't set in there. >> >> Much better solution: do unuse_temporary_mm() and *then* >> flush_tlb_mm_range(). This is entirely non-sketchy and should be >just >> about optimal, too. > >This solution sounds nice and clean. The fact the global-bit was set >didn’t >matter before (since __flush_tlb_one_user would get rid of it no matter >what), but would matter now, so I’ll change it too. > >Thanks! > >Nadav You can just disable the global bit at the top level, obviously. This approach also should make it far easier to do batching if desired.
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h index 50ba74a34a37..9da8cccdf3fb 100644 --- a/arch/x86/include/asm/fixmap.h +++ b/arch/x86/include/asm/fixmap.h @@ -103,8 +103,6 @@ enum fixed_addresses { #ifdef CONFIG_PARAVIRT FIX_PARAVIRT_BOOTMAP, #endif - FIX_TEXT_POKE1, /* reserve 2 pages for text_poke() */ - FIX_TEXT_POKE0, /* first page is last, because allocation is backward */ #ifdef CONFIG_X86_INTEL_MID FIX_LNW_VRTC, #endif diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 57fdde308bb6..8fc4685f3117 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -11,6 +11,7 @@ #include <linux/stop_machine.h> #include <linux/slab.h> #include <linux/kdebug.h> +#include <linux/mmu_context.h> #include <asm/text-patching.h> #include <asm/alternative.h> #include <asm/sections.h> @@ -683,41 +684,105 @@ __ro_after_init unsigned long poking_addr; static void *__text_poke(void *addr, const void *opcode, size_t len) { + bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE; + temporary_mm_state_t prev; + struct page *pages[2] = {NULL}; unsigned long flags; - char *vaddr; - struct page *pages[2]; - int i; + pte_t pte, *ptep; + spinlock_t *ptl; /* - * While boot memory allocator is runnig we cannot use struct - * pages as they are not yet initialized. + * While boot memory allocator is running we cannot use struct pages as + * they are not yet initialized. */ BUG_ON(!after_bootmem); if (!core_kernel_text((unsigned long)addr)) { pages[0] = vmalloc_to_page(addr); - pages[1] = vmalloc_to_page(addr + PAGE_SIZE); + if (cross_page_boundary) + pages[1] = vmalloc_to_page(addr + PAGE_SIZE); } else { pages[0] = virt_to_page(addr); WARN_ON(!PageReserved(pages[0])); - pages[1] = virt_to_page(addr + PAGE_SIZE); + if (cross_page_boundary) + pages[1] = virt_to_page(addr + PAGE_SIZE); } - BUG_ON(!pages[0]); + BUG_ON(!pages[0] || (cross_page_boundary && !pages[1])); + local_irq_save(flags); - set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0])); - if (pages[1]) - set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1])); - vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0); - memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len); - clear_fixmap(FIX_TEXT_POKE0); - if (pages[1]) - clear_fixmap(FIX_TEXT_POKE1); - local_flush_tlb(); - sync_core(); - /* Could also do a CLFLUSH here to speed up CPU recovery; but - that causes hangs on some VIA CPUs. */ - for (i = 0; i < len; i++) - BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]); + + /* + * The lock is not really needed, but this allows to avoid open-coding. + */ + ptep = get_locked_pte(poking_mm, poking_addr, &ptl); + + /* + * This must not fail; preallocated in poking_init(). + */ + VM_BUG_ON(!ptep); + + pte = mk_pte(pages[0], PAGE_KERNEL); + set_pte_at(poking_mm, poking_addr, ptep, pte); + + if (cross_page_boundary) { + pte = mk_pte(pages[1], PAGE_KERNEL); + set_pte_at(poking_mm, poking_addr + PAGE_SIZE, ptep + 1, pte); + } + + /* + * Loading the temporary mm behaves as a compiler barrier, which + * guarantees that the PTE will be set at the time memcpy() is done. + */ + prev = use_temporary_mm(poking_mm); + + kasan_disable_current(); + memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len); + kasan_enable_current(); + + /* + * Ensure that the PTE is only cleared after the instructions of memcpy + * were issued by using a compiler barrier. + */ + barrier(); + + pte_clear(poking_mm, poking_addr, ptep); + + /* + * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on, + * as it also flushes the corresponding "user" address spaces, which + * does not exist. + * + * Poking, however, is already very inefficient since it does not try to + * batch updates, so we ignore this problem for the time being. + * + * Since the PTEs do not exist in other kernel address-spaces, we do + * not use __flush_tlb_one_kernel(), which when PTI is on would cause + * more unwarranted TLB flushes. + * + * There is a slight anomaly here: the PTE is a supervisor-only and + * (potentially) global and we use __flush_tlb_one_user() but this + * should be fine. + */ + __flush_tlb_one_user(poking_addr); + if (cross_page_boundary) { + pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1); + __flush_tlb_one_user(poking_addr + PAGE_SIZE); + } + + /* + * Loading the previous page-table hierarchy requires a serializing + * instruction that already allows the core to see the updated version. + * Xen-PV is assumed to serialize execution in a similar manner. + */ + unuse_temporary_mm(prev); + + pte_unmap_unlock(ptep, ptl); + /* + * If the text doesn't match what we just wrote; something is + * fundamentally screwy, there's nothing we can really do about that. + */ + BUG_ON(memcmp(addr, opcode, len)); + local_irq_restore(flags); return addr; } diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 0f4fe206dcc2..82b181fcefe5 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -2319,8 +2319,6 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot) #elif defined(CONFIG_X86_VSYSCALL_EMULATION) case VSYSCALL_PAGE: #endif - case FIX_TEXT_POKE0: - case FIX_TEXT_POKE1: /* All local page mappings */ pte = pfn_pte(phys, prot); break;