Message ID | 1556228754-12996-6-git-send-email-rppt@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: introduce system calls addess space isolation | expand |
On Fri, Apr 26, 2019 at 12:45:52AM +0300, Mike Rapoport wrote: > If a system call runs in isolated context, it's accesses to kernel code and > data will be verified by SCI susbsytem. > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > --- > arch/x86/mm/fault.c | 28 ++++++++++++++++++++++++++++ > 1 file changed, 28 insertions(+) There's a distinct lack of touching do_double_fault(). It appears to me that you'll instantly trigger #DF when you #PF, because the #PF handler itself will not be able to run. And then obviously you have to be very careful to make sure #DF can, _at_all_times_ run, otherwise you'll tripple-fault and we all know what that does.
On Fri, Apr 26, 2019 at 09:42:23AM +0200, Peter Zijlstra wrote: > On Fri, Apr 26, 2019 at 12:45:52AM +0300, Mike Rapoport wrote: > > If a system call runs in isolated context, it's accesses to kernel code and > > data will be verified by SCI susbsytem. > > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > > --- > > arch/x86/mm/fault.c | 28 ++++++++++++++++++++++++++++ > > 1 file changed, 28 insertions(+) > > There's a distinct lack of touching do_double_fault(). It appears to me > that you'll instantly trigger #DF when you #PF, because the #PF handler > itself will not be able to run. The #PF handler is able to run. On interrupt/error entry the cr3 is switched to the full kernel page tables, pretty much like PTI does for user <-> kernel transitions. It's in the patch 3. > And then obviously you have to be very careful to make sure #DF can, > _at_all_times_ run, otherwise you'll tripple-fault and we all know what > that does. >
On Sat, Apr 27, 2019 at 10:47 PM Mike Rapoport <rppt@linux.ibm.com> wrote: > > On Fri, Apr 26, 2019 at 09:42:23AM +0200, Peter Zijlstra wrote: > > On Fri, Apr 26, 2019 at 12:45:52AM +0300, Mike Rapoport wrote: > > > If a system call runs in isolated context, it's accesses to kernel code and > > > data will be verified by SCI susbsytem. > > > > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > > > --- > > > arch/x86/mm/fault.c | 28 ++++++++++++++++++++++++++++ > > > 1 file changed, 28 insertions(+) > > > > There's a distinct lack of touching do_double_fault(). It appears to me > > that you'll instantly trigger #DF when you #PF, because the #PF handler > > itself will not be able to run. > > The #PF handler is able to run. On interrupt/error entry the cr3 is > switched to the full kernel page tables, pretty much like PTI does for > user <-> kernel transitions. It's in the patch 3. > > PeterZ meant page_fault, not do_page_fault. In your patch, page_fault and some of error_entry run before that magic switchover happens. If they're not in the page tables, you double-fault. And don't even try to do SCI magic in the double-fault handler. As I understand it, the SDM and APM aren't kidding when they say that #DF is an abort, not a fault. There is a single case in the kernel where we recover from #DF, and it was vetted by microcode people.
On Tue, Apr 30, 2019 at 09:44:09AM -0700, Andy Lutomirski wrote: > On Sat, Apr 27, 2019 at 10:47 PM Mike Rapoport <rppt@linux.ibm.com> wrote: > > > > On Fri, Apr 26, 2019 at 09:42:23AM +0200, Peter Zijlstra wrote: > > > On Fri, Apr 26, 2019 at 12:45:52AM +0300, Mike Rapoport wrote: > > > > If a system call runs in isolated context, it's accesses to kernel code and > > > > data will be verified by SCI susbsytem. > > > > > > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> > > > > --- > > > > arch/x86/mm/fault.c | 28 ++++++++++++++++++++++++++++ > > > > 1 file changed, 28 insertions(+) > > > > > > There's a distinct lack of touching do_double_fault(). It appears to me > > > that you'll instantly trigger #DF when you #PF, because the #PF handler > > > itself will not be able to run. > > > > The #PF handler is able to run. On interrupt/error entry the cr3 is > > switched to the full kernel page tables, pretty much like PTI does for > > user <-> kernel transitions. It's in the patch 3. > > > > > > PeterZ meant page_fault, not do_page_fault. In your patch, page_fault > and some of error_entry run before that magic switchover happens. If > they're not in the page tables, you double-fault. The entry code is in sci page tables, just like in user-space page tables with PTI. > And don't even try to do SCI magic in the double-fault handler. As I > understand it, the SDM and APM aren't kidding when they say that #DF > is an abort, not a fault. There is a single case in the kernel where > we recover from #DF, and it was vetted by microcode people. >
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 9d5c75f..baa2a2f 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -18,6 +18,7 @@ #include <linux/uaccess.h> /* faulthandler_disabled() */ #include <linux/efi.h> /* efi_recover_from_page_fault()*/ #include <linux/mm_types.h> +#include <linux/sci.h> /* sci_verify_and_map() */ #include <asm/cpufeature.h> /* boot_cpu_has, ... */ #include <asm/traps.h> /* dotraplinkage, ... */ @@ -1254,6 +1255,30 @@ static int fault_in_kernel_space(unsigned long address) return address >= TASK_SIZE_MAX; } +#ifdef CONFIG_SYSCALL_ISOLATION +static int sci_fault(struct pt_regs *regs, unsigned long hw_error_code, + unsigned long address) +{ + struct task_struct *tsk = current; + + if (!tsk->in_isolated_syscall) + return 0; + + if (!sci_verify_and_map(regs, address, hw_error_code)) { + this_cpu_write(cpu_sci.sci_syscall, 0); + no_context(regs, hw_error_code, address, SIGKILL, 0); + } + + return 1; +} +#else +static inline int sci_fault(struct pt_regs *regs, unsigned long hw_error_code, + unsigned long address) +{ + return 0; +} +#endif + /* * Called for all faults where 'address' is part of the kernel address * space. Might get called for faults that originate from *code* that @@ -1301,6 +1326,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, if (kprobes_fault(regs)) return; + if (sci_fault(regs, hw_error_code, address)) + return; + /* * Note, despite being a "bad area", there are quite a few * acceptable reasons to get here, such as erratum fixups
If a system call runs in isolated context, it's accesses to kernel code and data will be verified by SCI susbsytem. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> --- arch/x86/mm/fault.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+)