[RFC,v2,00/27] Kernel Address Space Isolation
mbox series

Message ID 1562855138-19507-1-git-send-email-alexandre.chartre@oracle.com
Headers show
Series
  • Kernel Address Space Isolation
Related show

Message

Alexandre Chartre July 11, 2019, 2:25 p.m. UTC
Hi,

This is version 2 of the "KVM Address Space Isolation" RFC. The code
has been completely changed compared to v1 and it now provides a generic
kernel framework which provides Address Space Isolation; and KVM is now
a simple consumer of that framework. That's why the RFC title has been
changed from "KVM Address Space Isolation" to "Kernel Address Space
Isolation".

Kernel Address Space Isolation aims to use address spaces to isolate some
parts of the kernel (for example KVM) to prevent leaking sensitive data
between hyper-threads under speculative execution attacks. You can refer
to the first version of this RFC for more context:

   https://lkml.org/lkml/2019/5/13/515

The new code is still a proof of concept. It is much more stable than v1:
I am able to run a VM with a full OS (and also a nested VM) with multiple
vcpus. But it looks like there are still some corner cases which cause the
system to crash/hang.

I am looking for feedback about this new approach where address space
isolation is provided by the kernel, and KVM is a just a consumer of this
new framework.


Changes
=======

- Address Space Isolation (ASI) is now provided as a kernel framework:
  interfaces for creating and managing an ASI are provided by the kernel,
  there are not implemented in KVM.

- An ASI is associated with a page-table, we don't use mm anymore. Entering
  isolation is done by just updating CR3 to use the ASI page-table. Exiting
  isolation restores CR3 with the CR3 value present before entering isolation.

- Isolation is exited at the beginning of any interrupt/exception handler,
  and on context switch.

- Isolation doesn't disable interrupt, but if an interrupt occurs the
  interrupt handler will exit isolation.

- The current stack is mapped when entering isolation and unmapped when
  exiting isolation.

- The current task is not mapped by default, but there's an option to map it.
  In such a case, the current task is mapped when entering isolation and
  unmap when exiting isolation.

- Kernel code mapped to the ASI page-table has been reduced to:
  . the entire kernel (I still need to test with only the kernel text)
  . the cpu entry area (because we need the GDT to be mapped)
  . the cpu ASI session (for managing ASI)
  . the current stack

- Optionally, an ASI can request the following kernel mapping to be added:
  . the stack canary
  . the cpu offsets (this_cpu_off)
  . the current task
  . RCU data (rcu_data)
  . CPU HW events (cpu_hw_events).

  All these optional mappings are used for KVM isolation.
  

Patches:
========

The proposed patches provides a framework for creating an Address Space
Isolation (ASI) (represented by a struct asi). The ASI has a page-table which
can be populated by copying mappings from the kernel page-table. The ASI can
then be entered/exited by switching between the kernel page-table and the
ASI page-table. In addition, any interrupt, exception or context switch
will automatically abort and exit the isolation. Finally patches use the
ASI framework to implement KVM isolation.

- 01-03: Core of the ASI framework: create/destroy ASI, enter/exit/abort
  isolation, ASI page-fault handler.

- 04-14: Functions to manage, populate and clear an ASI page-table.

- 15-20: ASI core mappings and optional mappings.

- 21: Make functions to read cr3/cr4 ASI aware

- 22-26: Use ASI in KVM to provide isolation for VMExit handlers.


API Overview:
=============
Here is a short description of the main ASI functions provided by the framwork.

struct asi *asi_create(int map_flags)

  Create an Address Space Isolation (ASI). map_flags can be used to specify
  optional kernel mapping to be added to the ASI page-table (for example,
  ASI_MAP_STACK_CANARY to map the stack canary).


void asi_destroy(struct asi *asi)

  Destroy an ASI.


int asi_enter(struct asi *asi)

  Enter isolation for the specified ASI. This switches from the kernel page-table
  to the page-table associated with the ASI.


void asi_exit(struct asi *asi)

  Exit isolation for the specified ASI. This switches back to the kernel
  page-table


int asi_map(struct asi *asi, void *ptr, unsigned long size);

  Copy kernel mapping to the specified ASI page-table.


void asi_unmap(struct asi *asi, void *ptr);

  Clear kernel mapping from the specified ASI page-table.


----
Alexandre Chartre (23):
  mm/x86: Introduce kernel address space isolation
  mm/asi: Abort isolation on interrupt, exception and context switch
  mm/asi: Handle page fault due to address space isolation
  mm/asi: Functions to track buffers allocated for an ASI page-table
  mm/asi: Add ASI page-table entry offset functions
  mm/asi: Add ASI page-table entry allocation functions
  mm/asi: Add ASI page-table entry set functions
  mm/asi: Functions to populate an ASI page-table from a VA range
  mm/asi: Helper functions to map module into ASI
  mm/asi: Keep track of VA ranges mapped in ASI page-table
  mm/asi: Functions to clear ASI page-table entries for a VA range
  mm/asi: Function to copy page-table entries for percpu buffer
  mm/asi: Add asi_remap() function
  mm/asi: Handle ASI mapped range leaks and overlaps
  mm/asi: Initialize the ASI page-table with core mappings
  mm/asi: Option to map current task into ASI
  rcu: Move tree.h static forward declarations to tree.c
  rcu: Make percpu rcu_data non-static
  mm/asi: Add option to map RCU data
  mm/asi: Add option to map cpu_hw_events
  mm/asi: Make functions to read cr3/cr4 ASI aware
  KVM: x86/asi: Populate the KVM ASI page-table
  KVM: x86/asi: Map KVM memslots and IO buses into KVM ASI

Liran Alon (3):
  KVM: x86/asi: Introduce address_space_isolation module parameter
  KVM: x86/asi: Introduce KVM address space isolation
  KVM: x86/asi: Switch to KVM address space on entry to guest

 arch/x86/entry/entry_64.S          |   42 ++-
 arch/x86/include/asm/asi.h         |  237 ++++++++
 arch/x86/include/asm/mmu_context.h |   20 +-
 arch/x86/include/asm/tlbflush.h    |   10 +
 arch/x86/kernel/asm-offsets.c      |    4 +
 arch/x86/kvm/Makefile              |    3 +-
 arch/x86/kvm/mmu.c                 |    2 +-
 arch/x86/kvm/vmx/isolation.c       |  231 ++++++++
 arch/x86/kvm/vmx/vmx.c             |   14 +-
 arch/x86/kvm/vmx/vmx.h             |   24 +
 arch/x86/kvm/x86.c                 |   68 +++-
 arch/x86/kvm/x86.h                 |    1 +
 arch/x86/mm/Makefile               |    2 +
 arch/x86/mm/asi.c                  |  459 +++++++++++++++
 arch/x86/mm/asi_pagetable.c        | 1077 ++++++++++++++++++++++++++++++++++++
 arch/x86/mm/fault.c                |    7 +
 include/linux/kvm_host.h           |    7 +
 kernel/rcu/tree.c                  |   56 ++-
 kernel/rcu/tree.h                  |   56 +--
 kernel/sched/core.c                |    4 +
 security/Kconfig                   |   10 +
 21 files changed, 2269 insertions(+), 65 deletions(-)
 create mode 100644 arch/x86/include/asm/asi.h
 create mode 100644 arch/x86/kvm/vmx/isolation.c
 create mode 100644 arch/x86/mm/asi.c
 create mode 100644 arch/x86/mm/asi_pagetable.c

Comments

Alexandre Chartre July 11, 2019, 2:40 p.m. UTC | #1
And I've just noticed that I've messed up the subject of the cover letter.
There are 26 patches, not 27. So it should have been 00/26 not 00/27.

Sorry about that.

alex.

On 7/11/19 4:25 PM, Alexandre Chartre wrote:
> Hi,
> 
> This is version 2 of the "KVM Address Space Isolation" RFC. The code
> has been completely changed compared to v1 and it now provides a generic
> kernel framework which provides Address Space Isolation; and KVM is now
> a simple consumer of that framework. That's why the RFC title has been
> changed from "KVM Address Space Isolation" to "Kernel Address Space
> Isolation".
> 
> Kernel Address Space Isolation aims to use address spaces to isolate some
> parts of the kernel (for example KVM) to prevent leaking sensitive data
> between hyper-threads under speculative execution attacks. You can refer
> to the first version of this RFC for more context:
> 
>     https://lkml.org/lkml/2019/5/13/515
> 
> The new code is still a proof of concept. It is much more stable than v1:
> I am able to run a VM with a full OS (and also a nested VM) with multiple
> vcpus. But it looks like there are still some corner cases which cause the
> system to crash/hang.
> 
> I am looking for feedback about this new approach where address space
> isolation is provided by the kernel, and KVM is a just a consumer of this
> new framework.
> 
> 
> Changes
> =======
> 
> - Address Space Isolation (ASI) is now provided as a kernel framework:
>    interfaces for creating and managing an ASI are provided by the kernel,
>    there are not implemented in KVM.
> 
> - An ASI is associated with a page-table, we don't use mm anymore. Entering
>    isolation is done by just updating CR3 to use the ASI page-table. Exiting
>    isolation restores CR3 with the CR3 value present before entering isolation.
> 
> - Isolation is exited at the beginning of any interrupt/exception handler,
>    and on context switch.
> 
> - Isolation doesn't disable interrupt, but if an interrupt occurs the
>    interrupt handler will exit isolation.
> 
> - The current stack is mapped when entering isolation and unmapped when
>    exiting isolation.
> 
> - The current task is not mapped by default, but there's an option to map it.
>    In such a case, the current task is mapped when entering isolation and
>    unmap when exiting isolation.
> 
> - Kernel code mapped to the ASI page-table has been reduced to:
>    . the entire kernel (I still need to test with only the kernel text)
>    . the cpu entry area (because we need the GDT to be mapped)
>    . the cpu ASI session (for managing ASI)
>    . the current stack
> 
> - Optionally, an ASI can request the following kernel mapping to be added:
>    . the stack canary
>    . the cpu offsets (this_cpu_off)
>    . the current task
>    . RCU data (rcu_data)
>    . CPU HW events (cpu_hw_events).
> 
>    All these optional mappings are used for KVM isolation.
>    
> 
> Patches:
> ========
> 
> The proposed patches provides a framework for creating an Address Space
> Isolation (ASI) (represented by a struct asi). The ASI has a page-table which
> can be populated by copying mappings from the kernel page-table. The ASI can
> then be entered/exited by switching between the kernel page-table and the
> ASI page-table. In addition, any interrupt, exception or context switch
> will automatically abort and exit the isolation. Finally patches use the
> ASI framework to implement KVM isolation.
> 
> - 01-03: Core of the ASI framework: create/destroy ASI, enter/exit/abort
>    isolation, ASI page-fault handler.
> 
> - 04-14: Functions to manage, populate and clear an ASI page-table.
> 
> - 15-20: ASI core mappings and optional mappings.
> 
> - 21: Make functions to read cr3/cr4 ASI aware
> 
> - 22-26: Use ASI in KVM to provide isolation for VMExit handlers.
> 
> 
> API Overview:
> =============
> Here is a short description of the main ASI functions provided by the framwork.
> 
> struct asi *asi_create(int map_flags)
> 
>    Create an Address Space Isolation (ASI). map_flags can be used to specify
>    optional kernel mapping to be added to the ASI page-table (for example,
>    ASI_MAP_STACK_CANARY to map the stack canary).
> 
> 
> void asi_destroy(struct asi *asi)
> 
>    Destroy an ASI.
> 
> 
> int asi_enter(struct asi *asi)
> 
>    Enter isolation for the specified ASI. This switches from the kernel page-table
>    to the page-table associated with the ASI.
> 
> 
> void asi_exit(struct asi *asi)
> 
>    Exit isolation for the specified ASI. This switches back to the kernel
>    page-table
> 
> 
> int asi_map(struct asi *asi, void *ptr, unsigned long size);
> 
>    Copy kernel mapping to the specified ASI page-table.
> 
> 
> void asi_unmap(struct asi *asi, void *ptr);
> 
>    Clear kernel mapping from the specified ASI page-table.
> 
> 
> ----
> Alexandre Chartre (23):
>    mm/x86: Introduce kernel address space isolation
>    mm/asi: Abort isolation on interrupt, exception and context switch
>    mm/asi: Handle page fault due to address space isolation
>    mm/asi: Functions to track buffers allocated for an ASI page-table
>    mm/asi: Add ASI page-table entry offset functions
>    mm/asi: Add ASI page-table entry allocation functions
>    mm/asi: Add ASI page-table entry set functions
>    mm/asi: Functions to populate an ASI page-table from a VA range
>    mm/asi: Helper functions to map module into ASI
>    mm/asi: Keep track of VA ranges mapped in ASI page-table
>    mm/asi: Functions to clear ASI page-table entries for a VA range
>    mm/asi: Function to copy page-table entries for percpu buffer
>    mm/asi: Add asi_remap() function
>    mm/asi: Handle ASI mapped range leaks and overlaps
>    mm/asi: Initialize the ASI page-table with core mappings
>    mm/asi: Option to map current task into ASI
>    rcu: Move tree.h static forward declarations to tree.c
>    rcu: Make percpu rcu_data non-static
>    mm/asi: Add option to map RCU data
>    mm/asi: Add option to map cpu_hw_events
>    mm/asi: Make functions to read cr3/cr4 ASI aware
>    KVM: x86/asi: Populate the KVM ASI page-table
>    KVM: x86/asi: Map KVM memslots and IO buses into KVM ASI
> 
> Liran Alon (3):
>    KVM: x86/asi: Introduce address_space_isolation module parameter
>    KVM: x86/asi: Introduce KVM address space isolation
>    KVM: x86/asi: Switch to KVM address space on entry to guest
> 
>   arch/x86/entry/entry_64.S          |   42 ++-
>   arch/x86/include/asm/asi.h         |  237 ++++++++
>   arch/x86/include/asm/mmu_context.h |   20 +-
>   arch/x86/include/asm/tlbflush.h    |   10 +
>   arch/x86/kernel/asm-offsets.c      |    4 +
>   arch/x86/kvm/Makefile              |    3 +-
>   arch/x86/kvm/mmu.c                 |    2 +-
>   arch/x86/kvm/vmx/isolation.c       |  231 ++++++++
>   arch/x86/kvm/vmx/vmx.c             |   14 +-
>   arch/x86/kvm/vmx/vmx.h             |   24 +
>   arch/x86/kvm/x86.c                 |   68 +++-
>   arch/x86/kvm/x86.h                 |    1 +
>   arch/x86/mm/Makefile               |    2 +
>   arch/x86/mm/asi.c                  |  459 +++++++++++++++
>   arch/x86/mm/asi_pagetable.c        | 1077 ++++++++++++++++++++++++++++++++++++
>   arch/x86/mm/fault.c                |    7 +
>   include/linux/kvm_host.h           |    7 +
>   kernel/rcu/tree.c                  |   56 ++-
>   kernel/rcu/tree.h                  |   56 +--
>   kernel/sched/core.c                |    4 +
>   security/Kconfig                   |   10 +
>   21 files changed, 2269 insertions(+), 65 deletions(-)
>   create mode 100644 arch/x86/include/asm/asi.h
>   create mode 100644 arch/x86/kvm/vmx/isolation.c
>   create mode 100644 arch/x86/mm/asi.c
>   create mode 100644 arch/x86/mm/asi_pagetable.c
>
Dave Hansen July 11, 2019, 10:38 p.m. UTC | #2
On 7/11/19 7:25 AM, Alexandre Chartre wrote:
> - Kernel code mapped to the ASI page-table has been reduced to:
>   . the entire kernel (I still need to test with only the kernel text)
>   . the cpu entry area (because we need the GDT to be mapped)
>   . the cpu ASI session (for managing ASI)
>   . the current stack
> 
> - Optionally, an ASI can request the following kernel mapping to be added:
>   . the stack canary
>   . the cpu offsets (this_cpu_off)
>   . the current task
>   . RCU data (rcu_data)
>   . CPU HW events (cpu_hw_events).

I don't see the per-cpu areas in here.  But, the ASI macros in
entry_64.S (and asi_start_abort()) use per-cpu data.

Also, this stuff seems to do naughty stuff (calling C code, touching
per-cpu data) before the PTI CR3 writes have been done.  But, I don't
see anything excluding PTI and this code from coexisting.
Alexandre Chartre July 12, 2019, 8:09 a.m. UTC | #3
On 7/12/19 12:38 AM, Dave Hansen wrote:
> On 7/11/19 7:25 AM, Alexandre Chartre wrote:
>> - Kernel code mapped to the ASI page-table has been reduced to:
>>    . the entire kernel (I still need to test with only the kernel text)
>>    . the cpu entry area (because we need the GDT to be mapped)
>>    . the cpu ASI session (for managing ASI)
>>    . the current stack
>>
>> - Optionally, an ASI can request the following kernel mapping to be added:
>>    . the stack canary
>>    . the cpu offsets (this_cpu_off)
>>    . the current task
>>    . RCU data (rcu_data)
>>    . CPU HW events (cpu_hw_events).
> 
> I don't see the per-cpu areas in here.  But, the ASI macros in
> entry_64.S (and asi_start_abort()) use per-cpu data.

We don't map all per-cpu areas, but only the per-cpu variables we need. ASI
code uses the per-cpu cpu_asi_session variable which is mapped when an ASI
is created (see patch 15/26):

+	/*
+	 * Map the percpu ASI sessions. This is used by interrupt handlers
+	 * to figure out if we have entered isolation and switch back to
+	 * the kernel address space.
+	 */
+	err = ASI_MAP_CPUVAR(asi, cpu_asi_session);
+	if (err)
+		return err;


> Also, this stuff seems to do naughty stuff (calling C code, touching
> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
> see anything excluding PTI and this code from coexisting.

My understanding is that PTI CR3 writes only happens when switching to/from
userland. While ASI enter/exit/abort happens while we are already in the kernel,
so asi_start_abort() is not called when coming from userland and so not
interacting with PTI.

For example, if ASI in used during a syscall (e.g. with KVM), we have:

  -> syscall
     - PTI CR3 write (kernel CR3)
     - syscall handler:
       ...
       asi_enter()-> write ASI CR3
       .. code run with ASI ..
       asi_exit() or asi abort -> restore original CR3
       ...
     - PTI CR3 write (userland CR3)
  <- syscall


Thanks,

alex.
Thomas Gleixner July 12, 2019, 10:44 a.m. UTC | #4
On Thu, 11 Jul 2019, Dave Hansen wrote:

> On 7/11/19 7:25 AM, Alexandre Chartre wrote:
> > - Kernel code mapped to the ASI page-table has been reduced to:
> >   . the entire kernel (I still need to test with only the kernel text)
> >   . the cpu entry area (because we need the GDT to be mapped)
> >   . the cpu ASI session (for managing ASI)
> >   . the current stack
> > 
> > - Optionally, an ASI can request the following kernel mapping to be added:
> >   . the stack canary
> >   . the cpu offsets (this_cpu_off)
> >   . the current task
> >   . RCU data (rcu_data)
> >   . CPU HW events (cpu_hw_events).
> 
> I don't see the per-cpu areas in here.  But, the ASI macros in
> entry_64.S (and asi_start_abort()) use per-cpu data.
> 
> Also, this stuff seems to do naughty stuff (calling C code, touching
> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
> see anything excluding PTI and this code from coexisting.

That ASI thing is just PTI on steroids.

So why do we need two versions of the same thing? That's absolutely bonkers
and will just introduce subtle bugs and conflicting decisions all over the
place.

The need for ASI is very tightly coupled to the need for PTI and there is
absolutely no point in keeping them separate.

The only difference vs. interrupts and exceptions is that the PTI logic
cares whether they enter from user or from kernel space while ASI only
cares about the kernel entry.

But most exceptions/interrupts transitions do not require to be handled at
the entry code level because on VMEXIT the exit reason clearly tells
whether a switch to the kernel CR3 is necessary or not. So this has to be
handled at the VMM level already in a very clean and simple way.

I'm not a virt wizard, but according to code inspection and instrumentation
even the NMI on the host is actually reinjected manually into the host via
'int $2' after the VMEXIT and for MCE it looks like manual handling as
well. So why do we need to sprinkle that muck all over the entry code?

From a semantical perspective VMENTER/VMEXIT are very similar to the return
to user / enter to user mechanics. Just that the transition happens in the
VMM code and not at the regular user/kernel transition points.

So why do you want ot treat that differently? There is absolutely zero
reason to do so. And there is no reason to create a pointlessly different
version of PTI which introduces yet another variant of a restricted page
table instead of just reusing and extending what's there already.

Thanks,

	tglx
Peter Zijlstra July 12, 2019, 11:44 a.m. UTC | #5
On Thu, Jul 11, 2019 at 04:25:12PM +0200, Alexandre Chartre wrote:
> Kernel Address Space Isolation aims to use address spaces to isolate some
> parts of the kernel (for example KVM) to prevent leaking sensitive data
> between hyper-threads under speculative execution attacks. You can refer
> to the first version of this RFC for more context:
> 
>    https://lkml.org/lkml/2019/5/13/515

No, no, no!

That is the crux of this entire series; you're not punting on explaining
exactly why we want to go dig through 26 patches of gunk.

You get to exactly explain what (your definition of) sensitive data is,
and which speculative scenarios and how this approach mitigates them.

And included in that is a high level overview of the whole thing.

On the one hand you've made this implementation for KVM, while on the
other hand you're saying it is generic but then fail to describe any
!KVM user.

AFAIK all speculative fails this is relevant to are now public, so
excruciating horrible details are fine and required.

AFAIK2 this is all because of MDS but it also helps with v1.

AFAIK3 this wants/needs to be combined with core-scheduling to be
useful, but not a single mention of that is anywhere.
Alexandre Chartre July 12, 2019, 11:56 a.m. UTC | #6
On 7/12/19 12:44 PM, Thomas Gleixner wrote:
> On Thu, 11 Jul 2019, Dave Hansen wrote:
> 
>> On 7/11/19 7:25 AM, Alexandre Chartre wrote:
>>> - Kernel code mapped to the ASI page-table has been reduced to:
>>>    . the entire kernel (I still need to test with only the kernel text)
>>>    . the cpu entry area (because we need the GDT to be mapped)
>>>    . the cpu ASI session (for managing ASI)
>>>    . the current stack
>>>
>>> - Optionally, an ASI can request the following kernel mapping to be added:
>>>    . the stack canary
>>>    . the cpu offsets (this_cpu_off)
>>>    . the current task
>>>    . RCU data (rcu_data)
>>>    . CPU HW events (cpu_hw_events).
>>
>> I don't see the per-cpu areas in here.  But, the ASI macros in
>> entry_64.S (and asi_start_abort()) use per-cpu data.
>>
>> Also, this stuff seems to do naughty stuff (calling C code, touching
>> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
>> see anything excluding PTI and this code from coexisting.
> 
> That ASI thing is just PTI on steroids.
> 
> So why do we need two versions of the same thing? That's absolutely bonkers
> and will just introduce subtle bugs and conflicting decisions all over the
> place.
> 
> The need for ASI is very tightly coupled to the need for PTI and there is
> absolutely no point in keeping them separate.
>
> The only difference vs. interrupts and exceptions is that the PTI logic
> cares whether they enter from user or from kernel space while ASI only
> cares about the kernel entry.

I think that's precisely what makes ASI and PTI different and independent.
PTI is just about switching between userland and kernel page-tables, while
ASI is about switching page-table inside the kernel. You can have ASI without
having PTI. You can also use ASI for kernel threads so for code that won't
be triggered from userland and so which won't involve PTI.

> But most exceptions/interrupts transitions do not require to be handled at
> the entry code level because on VMEXIT the exit reason clearly tells
> whether a switch to the kernel CR3 is necessary or not. So this has to be
> handled at the VMM level already in a very clean and simple way.
> 
> I'm not a virt wizard, but according to code inspection and instrumentation
> even the NMI on the host is actually reinjected manually into the host via
> 'int $2' after the VMEXIT and for MCE it looks like manual handling as
> well. So why do we need to sprinkle that muck all over the entry code?
> 
>  From a semantical perspective VMENTER/VMEXIT are very similar to the return
> to user / enter to user mechanics. Just that the transition happens in the
> VMM code and not at the regular user/kernel transition points.

VMExit returns to the kernel, and ASI is used to run the VMExit handler with
a limited kernel address space instead of using the full kernel address space.
Change in entry code is required to handle any interrupt/exception which
can happen while running code with ASI (like KVM VMExit handler).

Note that KVM is an example of an ASI consumer, but ASI is generic and can be
used to run (mostly) any kernel code if you want to run code with a reduced
kernel address space.

> So why do you want ot treat that differently? There is absolutely zero
> reason to do so. And there is no reason to create a pointlessly different
> version of PTI which introduces yet another variant of a restricted page
> table instead of just reusing and extending what's there already.
> 

As I've tried to explain, to me PTI and ASI are different and independent.
PTI manages switching between userland and kernel page-table, and ASI manages
switching between kernel and a reduced-kernel page-table.


Thanks,

alex.
Alexandre Chartre July 12, 2019, 12:17 p.m. UTC | #7
On 7/12/19 1:44 PM, Peter Zijlstra wrote:
> On Thu, Jul 11, 2019 at 04:25:12PM +0200, Alexandre Chartre wrote:
>> Kernel Address Space Isolation aims to use address spaces to isolate some
>> parts of the kernel (for example KVM) to prevent leaking sensitive data
>> between hyper-threads under speculative execution attacks. You can refer
>> to the first version of this RFC for more context:
>>
>>     https://lkml.org/lkml/2019/5/13/515
> 
> No, no, no!
> 
> That is the crux of this entire series; you're not punting on explaining
> exactly why we want to go dig through 26 patches of gunk.
> 
> You get to exactly explain what (your definition of) sensitive data is,
> and which speculative scenarios and how this approach mitigates them.
> 
> And included in that is a high level overview of the whole thing.
> 

Ok, I will rework the explanation. Sorry about that.

> On the one hand you've made this implementation for KVM, while on the
> other hand you're saying it is generic but then fail to describe any
> !KVM user.
> 
> AFAIK all speculative fails this is relevant to are now public, so
> excruciating horrible details are fine and required.

Ok.

> AFAIK2 this is all because of MDS but it also helps with v1.

Yes, mostly MDS and also L1TF.

> AFAIK3 this wants/needs to be combined with core-scheduling to be
> useful, but not a single mention of that is anywhere.

No. This is actually an alternative to core-scheduling. Eventually, ASI
will kick all sibling hyperthreads when exiting isolation and it needs to
run with the full kernel page-table (note that's currently not in these
patches).

So ASI can be seen as an optimization to disabling hyperthreading: instead
of just disabling hyperthreading you run with ASI, and when ASI can't preserve
isolation you will basically run with a single thread.

I will add all that to the explanation.

Thanks,

alex.
Peter Zijlstra July 12, 2019, 12:36 p.m. UTC | #8
On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
> On 7/12/19 1:44 PM, Peter Zijlstra wrote:

> > AFAIK3 this wants/needs to be combined with core-scheduling to be
> > useful, but not a single mention of that is anywhere.
> 
> No. This is actually an alternative to core-scheduling. Eventually, ASI
> will kick all sibling hyperthreads when exiting isolation and it needs to
> run with the full kernel page-table (note that's currently not in these
> patches).
> 
> So ASI can be seen as an optimization to disabling hyperthreading: instead
> of just disabling hyperthreading you run with ASI, and when ASI can't preserve
> isolation you will basically run with a single thread.

You can't do that without much of the scheduler changes present in the
core-scheduling patches.
Alexandre Chartre July 12, 2019, 12:47 p.m. UTC | #9
On 7/12/19 2:36 PM, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
>> On 7/12/19 1:44 PM, Peter Zijlstra wrote:
> 
>>> AFAIK3 this wants/needs to be combined with core-scheduling to be
>>> useful, but not a single mention of that is anywhere.
>>
>> No. This is actually an alternative to core-scheduling. Eventually, ASI
>> will kick all sibling hyperthreads when exiting isolation and it needs to
>> run with the full kernel page-table (note that's currently not in these
>> patches).
>>
>> So ASI can be seen as an optimization to disabling hyperthreading: instead
>> of just disabling hyperthreading you run with ASI, and when ASI can't preserve
>> isolation you will basically run with a single thread.
> 
> You can't do that without much of the scheduler changes present in the
> core-scheduling patches.
> 

We hope we can do that without the whole core-scheduling mechanism. The idea
is to send an IPI to all sibling hyperthreads. This IPI will interrupt these
sibling hyperthreads and have them wait for a condition that will allow them
to resume execution (for example when re-entering isolation). We are
investigating this in parallel to ASI.

alex.
Peter Zijlstra July 12, 2019, 12:50 p.m. UTC | #10
On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:

> I think that's precisely what makes ASI and PTI different and independent.
> PTI is just about switching between userland and kernel page-tables, while
> ASI is about switching page-table inside the kernel. You can have ASI without
> having PTI. You can also use ASI for kernel threads so for code that won't
> be triggered from userland and so which won't involve PTI.

PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).

See how very similar they are?

Furthermore, to recover SMT for userspace (under MDS) we not only need
core-scheduling but core-scheduling per address space. And ASI was
specifically designed to help mitigate the trainwreck just described.

By explicitly exposing (hopefully harmless) part of the kernel to MDS,
we reduce the part that needs core-scheduling and thus reduce the rate
the SMT siblngs need to sync up/schedule.

But looking at it that way, it makes no sense to retain 3 address
spaces, namely:

  user / kernel exposed / kernel private.

Specifically, it makes no sense to expose part of the kernel through MDS
but not through Meltdow. Therefore we can merge the user and kernel
exposed address spaces.

And then we've fully replaced PTI.

So no, they're not orthogonal.
Peter Zijlstra July 12, 2019, 1:07 p.m. UTC | #11
On Fri, Jul 12, 2019 at 02:47:23PM +0200, Alexandre Chartre wrote:
> On 7/12/19 2:36 PM, Peter Zijlstra wrote:
> > On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
> > > On 7/12/19 1:44 PM, Peter Zijlstra wrote:
> > 
> > > > AFAIK3 this wants/needs to be combined with core-scheduling to be
> > > > useful, but not a single mention of that is anywhere.
> > > 
> > > No. This is actually an alternative to core-scheduling. Eventually, ASI
> > > will kick all sibling hyperthreads when exiting isolation and it needs to
> > > run with the full kernel page-table (note that's currently not in these
> > > patches).
> > > 
> > > So ASI can be seen as an optimization to disabling hyperthreading: instead
> > > of just disabling hyperthreading you run with ASI, and when ASI can't preserve
> > > isolation you will basically run with a single thread.
> > 
> > You can't do that without much of the scheduler changes present in the
> > core-scheduling patches.
> > 
> 
> We hope we can do that without the whole core-scheduling mechanism. The idea
> is to send an IPI to all sibling hyperthreads. This IPI will interrupt these
> sibling hyperthreads and have them wait for a condition that will allow them
> to resume execution (for example when re-entering isolation). We are
> investigating this in parallel to ASI.

You cannot wait from IPI context, so you have to go somewhere else to
wait.

Also, consider what happens when the task that entered isolation decides
to schedule out / gets migrated.

I think you'll quickly find yourself back at core-scheduling.
Alexandre Chartre July 12, 2019, 1:43 p.m. UTC | #12
On 7/12/19 2:50 PM, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> 
>> I think that's precisely what makes ASI and PTI different and independent.
>> PTI is just about switching between userland and kernel page-tables, while
>> ASI is about switching page-table inside the kernel. You can have ASI without
>> having PTI. You can also use ASI for kernel threads so for code that won't
>> be triggered from userland and so which won't involve PTI.
> 
> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> 
> See how very similar they are?
>
> 
> Furthermore, to recover SMT for userspace (under MDS) we not only need
> core-scheduling but core-scheduling per address space. And ASI was
> specifically designed to help mitigate the trainwreck just described.
> 
> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> we reduce the part that needs core-scheduling and thus reduce the rate
> the SMT siblngs need to sync up/schedule.
> 
> But looking at it that way, it makes no sense to retain 3 address
> spaces, namely:
> 
>    user / kernel exposed / kernel private.
> 
> Specifically, it makes no sense to expose part of the kernel through MDS
> but not through Meltdow. Therefore we can merge the user and kernel
> exposed address spaces.

The goal of ASI is to provide a reduced address space which exclude sensitive
data. A user process (for example a database daemon, a web server, or a vmm
like qemu) will likely have sensitive data mapped in its user address space.
Such data shouldn't be mapped with ASI because it can potentially leak to the
sibling hyperthread. For example, if an hyperthread is running a VM then the
VM could potentially access user sensitive data if they are mapped on the
sibling hyperthread with ASI.

The current approach is assuming that anything in the user address space
can be sensitive, and so the user address space shouldn't be mapped in ASI.

It looks like what you are suggesting could be an optimization when creating
an ASI for a process which has no sensitive data (this could be an option to
specify when creating an ASI, for example).

alex.

> 
> And then we've fully replaced PTI.
> 
> So no, they're not orthogonal.
>
Alexandre Chartre July 12, 2019, 1:46 p.m. UTC | #13
On 7/12/19 3:07 PM, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 02:47:23PM +0200, Alexandre Chartre wrote:
>> On 7/12/19 2:36 PM, Peter Zijlstra wrote:
>>> On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
>>>> On 7/12/19 1:44 PM, Peter Zijlstra wrote:
>>>
>>>>> AFAIK3 this wants/needs to be combined with core-scheduling to be
>>>>> useful, but not a single mention of that is anywhere.
>>>>
>>>> No. This is actually an alternative to core-scheduling. Eventually, ASI
>>>> will kick all sibling hyperthreads when exiting isolation and it needs to
>>>> run with the full kernel page-table (note that's currently not in these
>>>> patches).
>>>>
>>>> So ASI can be seen as an optimization to disabling hyperthreading: instead
>>>> of just disabling hyperthreading you run with ASI, and when ASI can't preserve
>>>> isolation you will basically run with a single thread.
>>>
>>> You can't do that without much of the scheduler changes present in the
>>> core-scheduling patches.
>>>
>>
>> We hope we can do that without the whole core-scheduling mechanism. The idea
>> is to send an IPI to all sibling hyperthreads. This IPI will interrupt these
>> sibling hyperthreads and have them wait for a condition that will allow them
>> to resume execution (for example when re-entering isolation). We are
>> investigating this in parallel to ASI.
> 
> You cannot wait from IPI context, so you have to go somewhere else to
> wait.
> 
> Also, consider what happens when the task that entered isolation decides
> to schedule out / gets migrated.
> 
> I think you'll quickly find yourself back at core-scheduling.
> 

I haven't looked at details about what has been done so far. Hopefully, we
can do something not too complex, or reuse a (small) part of co-scheduling.

Thanks for pointing this out.

alex.
Dave Hansen July 12, 2019, 1:51 p.m. UTC | #14
On 7/12/19 1:09 AM, Alexandre Chartre wrote:
> On 7/12/19 12:38 AM, Dave Hansen wrote:
>> I don't see the per-cpu areas in here.  But, the ASI macros in
>> entry_64.S (and asi_start_abort()) use per-cpu data.
> 
> We don't map all per-cpu areas, but only the per-cpu variables we need. ASI
> code uses the per-cpu cpu_asi_session variable which is mapped when an ASI
> is created (see patch 15/26):

No fair!  I had per-cpu variables just for PTI at some point and had to
give them up! ;)

> +    /*
> +     * Map the percpu ASI sessions. This is used by interrupt handlers
> +     * to figure out if we have entered isolation and switch back to
> +     * the kernel address space.
> +     */
> +    err = ASI_MAP_CPUVAR(asi, cpu_asi_session);
> +    if (err)
> +        return err;
> 
> 
>> Also, this stuff seems to do naughty stuff (calling C code, touching
>> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
>> see anything excluding PTI and this code from coexisting.
> 
> My understanding is that PTI CR3 writes only happens when switching to/from
> userland. While ASI enter/exit/abort happens while we are already in the
> kernel,
> so asi_start_abort() is not called when coming from userland and so not
> interacting with PTI.

OK, that makes sense.  You only need to call C code when interrupted
from something in the kernel (deeper than the entry code), and those
were already running kernel C code anyway.

If this continues to live in the entry code, I think you have a good
clue where to start commenting.

BTW, the PTI CR3 writes are not *strictly* about the interrupt coming
from user vs. kernel.  It's tricky because there's a window both in the
entry and exit code where you are in the kernel but have a userspace CR3
value.  You end up needing a CR3 write when you have a userspace CR3
value when the interrupt occurred, not only when you interrupt userspace
itself.
Dave Hansen July 12, 2019, 1:54 p.m. UTC | #15
On 7/12/19 5:50 AM, Peter Zijlstra wrote:
> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> 
> See how very similar they are?

That's an interesting point.

I'd add that PTI maps a part of kernel space that partially overlaps
with what ASI wants.

> But looking at it that way, it makes no sense to retain 3 address
> spaces, namely:
> 
>   user / kernel exposed / kernel private.
> 
> Specifically, it makes no sense to expose part of the kernel through MDS
> but not through Meltdown. Therefore we can merge the user and kernel
> exposed address spaces.
> 
> And then we've fully replaced PTI.

So, in one address space (PTI/user or ASI), we say, "screw it" and all
the data mapped is exposed to speculation attacks.  We have to be very
careful about what we map and expose here.

The other (full kernel) address space we are more careful about what we
*do* instead of what we map.  We map everything but have to add
mitigations to ensure that we don't leak anything back to the exposed
address space.

So, maybe we're not replacing PTI as much as we're growing PTI so that
we can run more kernel code with the (now inappropriately named) user
page tables.
Dave Hansen July 12, 2019, 1:58 p.m. UTC | #16
On 7/12/19 6:43 AM, Alexandre Chartre wrote:
> The current approach is assuming that anything in the user address space
> can be sensitive, and so the user address space shouldn't be mapped in ASI.

Is this universally true?

There's certainly *some* mitigation provided by SMAP that would allow
userspace to remain mapped and still protected.
Alexandre Chartre July 12, 2019, 2:06 p.m. UTC | #17
On 7/12/19 3:51 PM, Dave Hansen wrote:
> On 7/12/19 1:09 AM, Alexandre Chartre wrote:
>> On 7/12/19 12:38 AM, Dave Hansen wrote:
>>> I don't see the per-cpu areas in here.  But, the ASI macros in
>>> entry_64.S (and asi_start_abort()) use per-cpu data.
>>
>> We don't map all per-cpu areas, but only the per-cpu variables we need. ASI
>> code uses the per-cpu cpu_asi_session variable which is mapped when an ASI
>> is created (see patch 15/26):
> 
> No fair!  I had per-cpu variables just for PTI at some point and had to
> give them up! ;)
> 
>> +    /*
>> +     * Map the percpu ASI sessions. This is used by interrupt handlers
>> +     * to figure out if we have entered isolation and switch back to
>> +     * the kernel address space.
>> +     */
>> +    err = ASI_MAP_CPUVAR(asi, cpu_asi_session);
>> +    if (err)
>> +        return err;
>>
>>
>>> Also, this stuff seems to do naughty stuff (calling C code, touching
>>> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
>>> see anything excluding PTI and this code from coexisting.
>>
>> My understanding is that PTI CR3 writes only happens when switching to/from
>> userland. While ASI enter/exit/abort happens while we are already in the
>> kernel,
>> so asi_start_abort() is not called when coming from userland and so not
>> interacting with PTI.
> 
> OK, that makes sense.  You only need to call C code when interrupted
> from something in the kernel (deeper than the entry code), and those
> were already running kernel C code anyway.
> 

Exactly.

> If this continues to live in the entry code, I think you have a good
> clue where to start commenting.

Yeah, lot of writing to do... :-)
  
> BTW, the PTI CR3 writes are not *strictly* about the interrupt coming
> from user vs. kernel.  It's tricky because there's a window both in the
> entry and exit code where you are in the kernel but have a userspace CR3
> value.  You end up needing a CR3 write when you have a userspace CR3
> value when the interrupt occurred, not only when you interrupt userspace
> itself.
> 

Right. ASI is simpler because it comes from the kernel and return to the
kernel. There's just a small window (on entry) where we have the ASI CR3
but we quickly switch to the full kernel CR3.

alex.
Andy Lutomirski July 12, 2019, 2:36 p.m. UTC | #18
On Fri, Jul 12, 2019 at 6:45 AM Alexandre Chartre
<alexandre.chartre@oracle.com> wrote:
>
>
> On 7/12/19 2:50 PM, Peter Zijlstra wrote:
> > On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> >
> >> I think that's precisely what makes ASI and PTI different and independent.
> >> PTI is just about switching between userland and kernel page-tables, while
> >> ASI is about switching page-table inside the kernel. You can have ASI without
> >> having PTI. You can also use ASI for kernel threads so for code that won't
> >> be triggered from userland and so which won't involve PTI.
> >
> > PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> > ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> >
> > See how very similar they are?
> >
> >
> > Furthermore, to recover SMT for userspace (under MDS) we not only need
> > core-scheduling but core-scheduling per address space. And ASI was
> > specifically designed to help mitigate the trainwreck just described.
> >
> > By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> > we reduce the part that needs core-scheduling and thus reduce the rate
> > the SMT siblngs need to sync up/schedule.
> >
> > But looking at it that way, it makes no sense to retain 3 address
> > spaces, namely:
> >
> >    user / kernel exposed / kernel private.
> >
> > Specifically, it makes no sense to expose part of the kernel through MDS
> > but not through Meltdow. Therefore we can merge the user and kernel
> > exposed address spaces.
>
> The goal of ASI is to provide a reduced address space which exclude sensitive
> data. A user process (for example a database daemon, a web server, or a vmm
> like qemu) will likely have sensitive data mapped in its user address space.
> Such data shouldn't be mapped with ASI because it can potentially leak to the
> sibling hyperthread. For example, if an hyperthread is running a VM then the
> VM could potentially access user sensitive data if they are mapped on the
> sibling hyperthread with ASI.

So I've proposed the following slightly hackish thing:

Add a mechanism (call it /dev/xpfo).  When you open /dev/xpfo and
fallocate it to some size, you allocate that amount of memory and kick
it out of the kernel direct map.  (And pay the IPI cost unless there
were already cached non-direct-mapped pages ready.)  Then you map
*that* into your VMs.  Now, for a dedicated VM host, you map *all* the
VM private memory from /dev/xpfo.  Pretend it's SEV if you want to
determine which pages can be set up like this.

Does this get enough of the benefit at a negligible fraction of the
code complexity cost?  (This plus core scheduling, anyway.)

--Andy
Thomas Gleixner July 12, 2019, 3:16 p.m. UTC | #19
On Fri, 12 Jul 2019, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> 
> > I think that's precisely what makes ASI and PTI different and independent.
> > PTI is just about switching between userland and kernel page-tables, while
> > ASI is about switching page-table inside the kernel. You can have ASI without
> > having PTI. You can also use ASI for kernel threads so for code that won't
> > be triggered from userland and so which won't involve PTI.
> 
> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> 
> See how very similar they are?
> 
> Furthermore, to recover SMT for userspace (under MDS) we not only need
> core-scheduling but core-scheduling per address space. And ASI was
> specifically designed to help mitigate the trainwreck just described.
> 
> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> we reduce the part that needs core-scheduling and thus reduce the rate
> the SMT siblngs need to sync up/schedule.
> 
> But looking at it that way, it makes no sense to retain 3 address
> spaces, namely:
> 
>   user / kernel exposed / kernel private.
> 
> Specifically, it makes no sense to expose part of the kernel through MDS
> but not through Meltdow. Therefore we can merge the user and kernel
> exposed address spaces.
> 
> And then we've fully replaced PTI.
> 
> So no, they're not orthogonal.

Right. If we decide to expose more parts of the kernel mappings then that's
just adding more stuff to the existing user (PTI) map mechanics.

As a consequence the CR3 switching points become different or can be
consolidated and that can be handled right at those switching points
depending on static keys or alternatives as we do today with PTI and other
mitigations.

All of that can do without that obscure "state machine" which is solely
there to duct-tape the complete lack of design. The same applies to that
mapping thing. Just mapping randomly selected parts by sticking them into
an array is a non-maintainable approach. This needs proper separation of
text and data sections, so violations of the mapping constraints can be
statically analyzed. Depending solely on the page fault at run time for
analysis is just bound to lead to hard to diagnose failures in the field.

TBH we all know already that this can be done and that this will solve some
of the issues caused by the speculation mess, so just writing some hastily
cobbled together POC code which explodes just by looking at it, does not
lead to anything else than time waste on all ends.

This first needs a clear definition of protection scope. That scope clearly
defines the required mappings and consequently the transition requirements
which provide the necessary transition points for flipping CR3.

If we have agreed on that, then we can think about the implementation
details.

Thanks,

	tglx
Peter Zijlstra July 12, 2019, 3:20 p.m. UTC | #20
On Fri, Jul 12, 2019 at 06:54:22AM -0700, Dave Hansen wrote:
> On 7/12/19 5:50 AM, Peter Zijlstra wrote:
> > PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> > ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> > 
> > See how very similar they are?
> 
> That's an interesting point.
> 
> I'd add that PTI maps a part of kernel space that partially overlaps
> with what ASI wants.

Right, wherever we put the boundary, we need whatever is required to
cross it.

> > But looking at it that way, it makes no sense to retain 3 address
> > spaces, namely:
> > 
> >   user / kernel exposed / kernel private.
> > 
> > Specifically, it makes no sense to expose part of the kernel through MDS
> > but not through Meltdown. Therefore we can merge the user and kernel
> > exposed address spaces.
> > 
> > And then we've fully replaced PTI.
> 
> So, in one address space (PTI/user or ASI), we say, "screw it" and all
> the data mapped is exposed to speculation attacks.  We have to be very
> careful about what we map and expose here.

Yes, which is why, in an earlier email, I've asked for a clear
definition of 'sensitive" :-)

> So, maybe we're not replacing PTI as much as we're growing PTI so that
> we can run more kernel code with the (now inappropriately named) user
> page tables.

Right.
Thomas Gleixner July 12, 2019, 3:23 p.m. UTC | #21
On Fri, 12 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 3:51 PM, Dave Hansen wrote:
> > BTW, the PTI CR3 writes are not *strictly* about the interrupt coming
> > from user vs. kernel.  It's tricky because there's a window both in the
> > entry and exit code where you are in the kernel but have a userspace CR3
> > value.  You end up needing a CR3 write when you have a userspace CR3
> > value when the interrupt occurred, not only when you interrupt userspace
> > itself.
> > 
> 
> Right. ASI is simpler because it comes from the kernel and return to the
> kernel. There's just a small window (on entry) where we have the ASI CR3
> but we quickly switch to the full kernel CR3.

That's wrong in several aspects.

   1) You are looking at it purely from the VMM perspective, which is bogus
      as you already said, that this can/should be used to be extended to
      other scenarios (including kvm ioctl or such).

      So no, it's not just coming from kernel space and returning to it.

      If that'd be true then the entry code could just stay as is because
      you can handle _ALL_ of that very trivial in the atomic VMM
      enter/exit code.

   2) It does not matter how small that window is. If there is a window
      then this needs to be covered, no matter what.

Thanks,

	tglx
Thomas Gleixner July 12, 2019, 4 p.m. UTC | #22
On Fri, 12 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 12:44 PM, Thomas Gleixner wrote:
> > That ASI thing is just PTI on steroids.
> > 
> > So why do we need two versions of the same thing? That's absolutely bonkers
> > and will just introduce subtle bugs and conflicting decisions all over the
> > place.
> > 
> > The need for ASI is very tightly coupled to the need for PTI and there is
> > absolutely no point in keeping them separate.
> > 
> > The only difference vs. interrupts and exceptions is that the PTI logic
> > cares whether they enter from user or from kernel space while ASI only
> > cares about the kernel entry.
> 
> I think that's precisely what makes ASI and PTI different and independent.
> PTI is just about switching between userland and kernel page-tables, while
> ASI is about switching page-table inside the kernel. You can have ASI without
> having PTI. You can also use ASI for kernel threads so for code that won't
> be triggered from userland and so which won't involve PTI.

It's still the same concept. And you can argue in circles it does not
justify yet another mapping setup with is a different copy of some other
mapping setup. Whether PTI is replaced by ASI or PTI is extended to handle
ASI does not matter at all. Having two similar concepts side by side is a
guarantee for disaster.

> > So why do you want ot treat that differently? There is absolutely zero
> > reason to do so. And there is no reason to create a pointlessly different
> > version of PTI which introduces yet another variant of a restricted page
> > table instead of just reusing and extending what's there already.
> > 
> 
> As I've tried to explain, to me PTI and ASI are different and independent.
> PTI manages switching between userland and kernel page-table, and ASI manages
> switching between kernel and a reduced-kernel page-table.

Again. It's the same concept and it does not matter what form of reduced
page tables you use. You always need transition points and in order to make
the transition points work you need reliably mapped bits and pieces.

Also Paul wants to use the same concept for user space so trivial system
calls can do w/o PTI. In some other thread you said yourself that this
could be extended to cover the kvm ioctl, which is clearly a return to user
space.

Are we then going to add another set of randomly sprinkled transition
points and yet another 'state machine' to duct-tape the fallout?

Definitely not going to happen.

Thanks,

	tglx
Alexandre Chartre July 12, 2019, 4:37 p.m. UTC | #23
On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>
>>> I think that's precisely what makes ASI and PTI different and independent.
>>> PTI is just about switching between userland and kernel page-tables, while
>>> ASI is about switching page-table inside the kernel. You can have ASI without
>>> having PTI. You can also use ASI for kernel threads so for code that won't
>>> be triggered from userland and so which won't involve PTI.
>>
>> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>>
>> See how very similar they are?
>>
>> Furthermore, to recover SMT for userspace (under MDS) we not only need
>> core-scheduling but core-scheduling per address space. And ASI was
>> specifically designed to help mitigate the trainwreck just described.
>>
>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
>> we reduce the part that needs core-scheduling and thus reduce the rate
>> the SMT siblngs need to sync up/schedule.
>>
>> But looking at it that way, it makes no sense to retain 3 address
>> spaces, namely:
>>
>>    user / kernel exposed / kernel private.
>>
>> Specifically, it makes no sense to expose part of the kernel through MDS
>> but not through Meltdow. Therefore we can merge the user and kernel
>> exposed address spaces.
>>
>> And then we've fully replaced PTI.
>>
>> So no, they're not orthogonal.
> 
> Right. If we decide to expose more parts of the kernel mappings then that's
> just adding more stuff to the existing user (PTI) map mechanics.
  

If we expose more parts of the kernel mapping by adding them to the existing
user (PTI) map, then we only control the mapping of kernel sensitive data but
we don't control user mapping (with ASI, we exclude all user mappings).

How would you control the mapping of userland sensitive data and exclude them
from the user map? Would you have the application explicitly identify sensitive
data (like Andy suggested with a /dev/xpfo device)?

Thanks,

alex.


> As a consequence the CR3 switching points become different or can be
> consolidated and that can be handled right at those switching points
> depending on static keys or alternatives as we do today with PTI and other
> mitigations.
> 
> All of that can do without that obscure "state machine" which is solely
> there to duct-tape the complete lack of design. The same applies to that
> mapping thing. Just mapping randomly selected parts by sticking them into
> an array is a non-maintainable approach. This needs proper separation of
> text and data sections, so violations of the mapping constraints can be
> statically analyzed. Depending solely on the page fault at run time for
> analysis is just bound to lead to hard to diagnose failures in the field.
> 
> TBH we all know already that this can be done and that this will solve some
> of the issues caused by the speculation mess, so just writing some hastily
> cobbled together POC code which explodes just by looking at it, does not
> lead to anything else than time waste on all ends.
> 
> This first needs a clear definition of protection scope. That scope clearly
> defines the required mappings and consequently the transition requirements
> which provide the necessary transition points for flipping CR3.
> 
> If we have agreed on that, then we can think about the implementation
> details.
> 
> Thanks,
> 
> 	tglx
>
Andy Lutomirski July 12, 2019, 4:45 p.m. UTC | #24
> On Jul 12, 2019, at 10:37 AM, Alexandre Chartre <alexandre.chartre@oracle.com> wrote:
> 
> 
> 
>> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
>>> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
>>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>>> 
>>>> I think that's precisely what makes ASI and PTI different and independent.
>>>> PTI is just about switching between userland and kernel page-tables, while
>>>> ASI is about switching page-table inside the kernel. You can have ASI without
>>>> having PTI. You can also use ASI for kernel threads so for code that won't
>>>> be triggered from userland and so which won't involve PTI.
>>> 
>>> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
>>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>>> 
>>> See how very similar they are?
>>> 
>>> Furthermore, to recover SMT for userspace (under MDS) we not only need
>>> core-scheduling but core-scheduling per address space. And ASI was
>>> specifically designed to help mitigate the trainwreck just described.
>>> 
>>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
>>> we reduce the part that needs core-scheduling and thus reduce the rate
>>> the SMT siblngs need to sync up/schedule.
>>> 
>>> But looking at it that way, it makes no sense to retain 3 address
>>> spaces, namely:
>>> 
>>>   user / kernel exposed / kernel private.
>>> 
>>> Specifically, it makes no sense to expose part of the kernel through MDS
>>> but not through Meltdow. Therefore we can merge the user and kernel
>>> exposed address spaces.
>>> 
>>> And then we've fully replaced PTI.
>>> 
>>> So no, they're not orthogonal.
>> Right. If we decide to expose more parts of the kernel mappings then that's
>> just adding more stuff to the existing user (PTI) map mechanics.
> 
> If we expose more parts of the kernel mapping by adding them to the existing
> user (PTI) map, then we only control the mapping of kernel sensitive data but
> we don't control user mapping (with ASI, we exclude all user mappings).
> 
> How would you control the mapping of userland sensitive data and exclude them
> from the user map?

As I see it, if we think part of the kernel is okay to leak to VM guests, then it should think it’s okay to leak to userspace and versa. At the end of the day, this may just have to come down to an administrator’s choice of how careful the mitigations need to be.

> Would you have the application explicitly identify sensitive
> data (like Andy suggested with a /dev/xpfo device)?

That’s not really the intent of my suggestion. I was suggesting that maybe we don’t need ASI at all if we allow VMs to exclude their memory from the kernel mapping entirely.  Heck, in a setup like this, we can maybe even get away with turning PTI off under very, very controlled circumstances.  I’m not quite sure what to do about the kernel random pools, though.
Peter Zijlstra July 12, 2019, 7:06 p.m. UTC | #25
On Fri, Jul 12, 2019 at 06:37:47PM +0200, Alexandre Chartre wrote:
> On 7/12/19 5:16 PM, Thomas Gleixner wrote:

> > Right. If we decide to expose more parts of the kernel mappings then that's
> > just adding more stuff to the existing user (PTI) map mechanics.
> 
> If we expose more parts of the kernel mapping by adding them to the existing
> user (PTI) map, then we only control the mapping of kernel sensitive data but
> we don't control user mapping (with ASI, we exclude all user mappings).
> 
> How would you control the mapping of userland sensitive data and exclude them
> from the user map? Would you have the application explicitly identify sensitive
> data (like Andy suggested with a /dev/xpfo device)?

To what purpose do you want to exclude userspace from the kernel
mapping; that is, what are you mitigating against with that?
Thomas Gleixner July 12, 2019, 7:48 p.m. UTC | #26
On Fri, 12 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> > On Fri, 12 Jul 2019, Peter Zijlstra wrote:
> > > On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> > > And then we've fully replaced PTI.
> > > 
> > > So no, they're not orthogonal.
> > 
> > Right. If we decide to expose more parts of the kernel mappings then that's
> > just adding more stuff to the existing user (PTI) map mechanics.
>  
> If we expose more parts of the kernel mapping by adding them to the existing
> user (PTI) map, then we only control the mapping of kernel sensitive data but
> we don't control user mapping (with ASI, we exclude all user mappings).

What prevents you from adding functionality to do so to the PTI
implementation? Nothing.

Again, the underlying concept is exactly the same:

  1) Create a restricted mapping from an existing mapping

  2) Switch to the restricted mapping when entering a particular execution
     context

  3) Switch to the unrestricted mapping when leaving that execution context

  4) Keep track of the state

The restriction scope is different, but that's conceptually completely
irrelevant. It's a detail which needs to be handled at the implementation
level.

What matters here is the concept and because the concept is the same, this
needs to share the infrastructure for #1 - #4.

It's obvious that this requires changes to the way PTI works today, but
anything which creates a parallel implementation of any part of the above
#1 - #4 is not going anywhere.

This stuff is way too sensitive and has pretty well understood limitations
and corner cases. So it needs to be designed from ground up to handle these
proper. Which also means, that the possible use cases are going to be
limited.

As I said before, come up with a list of possible usage scenarios and
protection scopes first and please take all the ideas other people have
with this into account. This includes PTI of course.

Once we have that we need to figure out whether these things can actually
coexist and do not contradict each other at the semantical level and
whether the outcome justifies the resulting complexity.

After that we can talk about implementation details.

This problem is not going to be solved with handwaving and an ad hoc
implementation which creates more problems than it solves.

Thanks,

	tglx
Andy Lutomirski July 14, 2019, 3:06 p.m. UTC | #27
On Fri, Jul 12, 2019 at 12:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 12, 2019 at 06:37:47PM +0200, Alexandre Chartre wrote:
> > On 7/12/19 5:16 PM, Thomas Gleixner wrote:
>
> > > Right. If we decide to expose more parts of the kernel mappings then that's
> > > just adding more stuff to the existing user (PTI) map mechanics.
> >
> > If we expose more parts of the kernel mapping by adding them to the existing
> > user (PTI) map, then we only control the mapping of kernel sensitive data but
> > we don't control user mapping (with ASI, we exclude all user mappings).
> >
> > How would you control the mapping of userland sensitive data and exclude them
> > from the user map? Would you have the application explicitly identify sensitive
> > data (like Andy suggested with a /dev/xpfo device)?
>
> To what purpose do you want to exclude userspace from the kernel
> mapping; that is, what are you mitigating against with that?

Mutually distrusting user/guest tenants.  Imagine an attack against a
VM hosting provider (GCE, for example).  If the overall system is
well-designed, the host kernel won't possess secrets that are
important to the overall hosting network.  The interesting secrets are
in the memory of other tenants running under the same host.  So, if we
can mostly or completely avoid mapping one tenant's memory in the
host, we reduce the amount of valuable information that could leak via
a speculation (or wild read) attack to another tenant.

The practicality of such a scheme is obviously an open question.
Mike Rapoport July 14, 2019, 5:11 p.m. UTC | #28
On Fri, Jul 12, 2019 at 10:45:06AM -0600, Andy Lutomirski wrote:
> 
> 
> > On Jul 12, 2019, at 10:37 AM, Alexandre Chartre <alexandre.chartre@oracle.com> wrote:
> > 
> > 
> > 
> >> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> >>> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
> >>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> >>>> 
> >>>> I think that's precisely what makes ASI and PTI different and independent.
> >>>> PTI is just about switching between userland and kernel page-tables, while
> >>>> ASI is about switching page-table inside the kernel. You can have ASI without
> >>>> having PTI. You can also use ASI for kernel threads so for code that won't
> >>>> be triggered from userland and so which won't involve PTI.
> >>> 
> >>> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
> >>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> >>> 
> >>> See how very similar they are?
> >>> 
> >>> Furthermore, to recover SMT for userspace (under MDS) we not only need
> >>> core-scheduling but core-scheduling per address space. And ASI was
> >>> specifically designed to help mitigate the trainwreck just described.
> >>> 
> >>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> >>> we reduce the part that needs core-scheduling and thus reduce the rate
> >>> the SMT siblngs need to sync up/schedule.
> >>> 
> >>> But looking at it that way, it makes no sense to retain 3 address
> >>> spaces, namely:
> >>> 
> >>>   user / kernel exposed / kernel private.
> >>> 
> >>> Specifically, it makes no sense to expose part of the kernel through MDS
> >>> but not through Meltdow. Therefore we can merge the user and kernel
> >>> exposed address spaces.
> >>> 
> >>> And then we've fully replaced PTI.
> >>> 
> >>> So no, they're not orthogonal.
> >> Right. If we decide to expose more parts of the kernel mappings then that's
> >> just adding more stuff to the existing user (PTI) map mechanics.
> > 
> > If we expose more parts of the kernel mapping by adding them to the existing
> > user (PTI) map, then we only control the mapping of kernel sensitive data but
> > we don't control user mapping (with ASI, we exclude all user mappings).
> > 
> > How would you control the mapping of userland sensitive data and exclude them
> > from the user map?
> 
> As I see it, if we think part of the kernel is okay to leak to VM guests,
> then it should think it’s okay to leak to userspace and versa. At the end
> of the day, this may just have to come down to an administrator’s choice
> of how careful the mitigations need to be.
> 
> > Would you have the application explicitly identify sensitive
> > data (like Andy suggested with a /dev/xpfo device)?
> 
> That’s not really the intent of my suggestion. I was suggesting that
> maybe we don’t need ASI at all if we allow VMs to exclude their memory
> from the kernel mapping entirely.  Heck, in a setup like this, we can
> maybe even get away with turning PTI off under very, very controlled
> circumstances.  I’m not quite sure what to do about the kernel random
> pools, though.

I think KVM already allows excluding VMs memory from the kernel mapping
with the "new guest mapping interface" [1]. The memory managed by the host
can be restricted with "mem=" and KVM maps/unmaps the guest memory pages
only when needed.

It would be interesting to see if /dev/xpfo or even
madvise(MAKE_MY_MEMORY_PRIVATE) can be made useful for multi-tenant
container hosts.

[1] https://lore.kernel.org/lkml/1548966284-28642-1-git-send-email-karahmed@amazon.de/
Alexander Graf July 14, 2019, 6:17 p.m. UTC | #29
On 12.07.19 16:36, Andy Lutomirski wrote:
> On Fri, Jul 12, 2019 at 6:45 AM Alexandre Chartre
> <alexandre.chartre@oracle.com> wrote:
>>
>>
>> On 7/12/19 2:50 PM, Peter Zijlstra wrote:
>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>>
>>>> I think that's precisely what makes ASI and PTI different and independent.
>>>> PTI is just about switching between userland and kernel page-tables, while
>>>> ASI is about switching page-table inside the kernel. You can have ASI without
>>>> having PTI. You can also use ASI for kernel threads so for code that won't
>>>> be triggered from userland and so which won't involve PTI.
>>>
>>> PTI is not mapping         kernel space to avoid             speculation crap (meltdown).
>>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>>>
>>> See how very similar they are?
>>>
>>>
>>> Furthermore, to recover SMT for userspace (under MDS) we not only need
>>> core-scheduling but core-scheduling per address space. And ASI was
>>> specifically designed to help mitigate the trainwreck just described.
>>>
>>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
>>> we reduce the part that needs core-scheduling and thus reduce the rate
>>> the SMT siblngs need to sync up/schedule.
>>>
>>> But looking at it that way, it makes no sense to retain 3 address
>>> spaces, namely:
>>>
>>>     user / kernel exposed / kernel private.
>>>
>>> Specifically, it makes no sense to expose part of the kernel through MDS
>>> but not through Meltdow. Therefore we can merge the user and kernel
>>> exposed address spaces.
>>
>> The goal of ASI is to provide a reduced address space which exclude sensitive
>> data. A user process (for example a database daemon, a web server, or a vmm
>> like qemu) will likely have sensitive data mapped in its user address space.
>> Such data shouldn't be mapped with ASI because it can potentially leak to the
>> sibling hyperthread. For example, if an hyperthread is running a VM then the
>> VM could potentially access user sensitive data if they are mapped on the
>> sibling hyperthread with ASI.
> 
> So I've proposed the following slightly hackish thing:
> 
> Add a mechanism (call it /dev/xpfo).  When you open /dev/xpfo and
> fallocate it to some size, you allocate that amount of memory and kick
> it out of the kernel direct map.  (And pay the IPI cost unless there
> were already cached non-direct-mapped pages ready.)  Then you map
> *that* into your VMs.  Now, for a dedicated VM host, you map *all* the
> VM private memory from /dev/xpfo.  Pretend it's SEV if you want to
> determine which pages can be set up like this.
> 
> Does this get enough of the benefit at a negligible fraction of the
> code complexity cost?  (This plus core scheduling, anyway.)

The problem with that approach is that you lose the ability to run 
legacy workloads that do not support an SEV like model of "guest owned" 
and "host visible" pages, but instead assume you can DMA anywhere.

Without that, your host will have visibility into guest pages via user 
space (QEMU) pages which again are mapped in the kernel direct map, so 
can be exposed via a spectre gadget into a malicious guest.

Also, please keep in mind that even register state of other VMs may be a 
secret that we do not want to leak into other guests.


Alex
Alexandre Chartre July 15, 2019, 8:23 a.m. UTC | #30
On 7/12/19 9:48 PM, Thomas Gleixner wrote:
> On Fri, 12 Jul 2019, Alexandre Chartre wrote:
>> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
>>> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
>>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>>> And then we've fully replaced PTI.
>>>>
>>>> So no, they're not orthogonal.
>>>
>>> Right. If we decide to expose more parts of the kernel mappings then that's
>>> just adding more stuff to the existing user (PTI) map mechanics.
>>   
>> If we expose more parts of the kernel mapping by adding them to the existing
>> user (PTI) map, then we only control the mapping of kernel sensitive data but
>> we don't control user mapping (with ASI, we exclude all user mappings).
> 
> What prevents you from adding functionality to do so to the PTI
> implementation? Nothing.
> 
> Again, the underlying concept is exactly the same:
> 
>    1) Create a restricted mapping from an existing mapping
> 
>    2) Switch to the restricted mapping when entering a particular execution
>       context
> 
>    3) Switch to the unrestricted mapping when leaving that execution context
> 
>    4) Keep track of the state
> 
> The restriction scope is different, but that's conceptually completely
> irrelevant. It's a detail which needs to be handled at the implementation
> level.
> 
> What matters here is the concept and because the concept is the same, this
> needs to share the infrastructure for #1 - #4.
> 

You are totally right, that's the same concept (page-table creation and switching),
it is just used in different contexts. Sorry it took me that long to realize it,
I was too focus on the use case.


> It's obvious that this requires changes to the way PTI works today, but
> anything which creates a parallel implementation of any part of the above
> #1 - #4 is not going anywhere.
> 
> This stuff is way too sensitive and has pretty well understood limitations
> and corner cases. So it needs to be designed from ground up to handle these
> proper. Which also means, that the possible use cases are going to be
> limited.
>
> As I said before, come up with a list of possible usage scenarios and
> protection scopes first and please take all the ideas other people have
> with this into account. This includes PTI of course.
> 
> Once we have that we need to figure out whether these things can actually
> coexist and do not contradict each other at the semantical level and
> whether the outcome justifies the resulting complexity.
> 
> After that we can talk about implementation details.

Right, that makes perfect sense. I think so far we have the following scenarios:

  - PTI
  - KVM (i.e. VMExit handler isolation)
  - maybe some syscall isolation?

I will look at them in more details, in particular what particular mappings they
need and when they need to switch mappings.


And thanks for putting me back on the right track.


alex.

> This problem is not going to be solved with handwaving and an ad hoc
> implementation which creates more problems than it solves.
> 
> Thanks,
> 
> 	tglx
>
Thomas Gleixner July 15, 2019, 8:28 a.m. UTC | #31
Alexandre,

On Mon, 15 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 9:48 PM, Thomas Gleixner wrote:
> > As I said before, come up with a list of possible usage scenarios and
> > protection scopes first and please take all the ideas other people have
> > with this into account. This includes PTI of course.
> > 
> > Once we have that we need to figure out whether these things can actually
> > coexist and do not contradict each other at the semantical level and
> > whether the outcome justifies the resulting complexity.
> > 
> > After that we can talk about implementation details.
> 
> Right, that makes perfect sense. I think so far we have the following
> scenarios:
> 
>  - PTI
>  - KVM (i.e. VMExit handler isolation)
>  - maybe some syscall isolation?

Vs. the latter you want to talk to Paul Turner. He had some ideas there.

> I will look at them in more details, in particular what particular
> mappings they need and when they need to switch mappings.
> 
> And thanks for putting me back on the right track.

That's what maintainers are for :)

Thanks,

	tglx
Peter Zijlstra July 15, 2019, 10:33 a.m. UTC | #32
On Sun, Jul 14, 2019 at 08:06:12AM -0700, Andy Lutomirski wrote:
> On Fri, Jul 12, 2019 at 12:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Jul 12, 2019 at 06:37:47PM +0200, Alexandre Chartre wrote:
> > > On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> >
> > > > Right. If we decide to expose more parts of the kernel mappings then that's
> > > > just adding more stuff to the existing user (PTI) map mechanics.
> > >
> > > If we expose more parts of the kernel mapping by adding them to the existing
> > > user (PTI) map, then we only control the mapping of kernel sensitive data but
> > > we don't control user mapping (with ASI, we exclude all user mappings).
> > >
> > > How would you control the mapping of userland sensitive data and exclude them
> > > from the user map? Would you have the application explicitly identify sensitive
> > > data (like Andy suggested with a /dev/xpfo device)?
> >
> > To what purpose do you want to exclude userspace from the kernel
> > mapping; that is, what are you mitigating against with that?
> 
> Mutually distrusting user/guest tenants.  Imagine an attack against a
> VM hosting provider (GCE, for example).  If the overall system is
> well-designed, the host kernel won't possess secrets that are
> important to the overall hosting network.  The interesting secrets are
> in the memory of other tenants running under the same host.  So, if we
> can mostly or completely avoid mapping one tenant's memory in the
> host, we reduce the amount of valuable information that could leak via
> a speculation (or wild read) attack to another tenant.
> 
> The practicality of such a scheme is obviously an open question.

Ah, ok. So it's some virt specific nonsense. I'll go on ignoring it then
;-)
Dario Faggioli July 31, 2019, 4:31 p.m. UTC | #33
Hello all,

I know this is a bit of an old thread, so apologies for being late to
the party. :-)

I would have a question about this:

> > > On 7/12/19 2:36 PM, Peter Zijlstra wrote:
> > > > On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre
> > > > wrote:
> > > > > On 7/12/19 1:44 PM, Peter Zijlstra wrote:
> > > > > > AFAIK3 this wants/needs to be combined with core-scheduling 
> > > > > > to be
> > > > > > useful, but not a single mention of that is anywhere.
> > > > > 
> > > > > No. This is actually an alternative to core-scheduling.
> > > > > Eventually, ASI
> > > > > will kick all sibling hyperthreads when exiting isolation and
> > > > > it needs to
> > > > > run with the full kernel page-table (note that's currently
> > > > > not in these
> > > > > patches).
> 
I.e., about the fact that ASI is presented as an alternative to
core-scheduling or, at least, as it will only need integrate a small
subset of the logic (and of the code) from core-scheduling, as said
here:

> I haven't looked at details about what has been done so far.
> Hopefully, we
> can do something not too complex, or reuse a (small) part of co-
> scheduling.
> 
Now, sticking to virtualization examples, if you don't have core-
scheduling, it means that you can have two vcpus, one from VM A and the
other from VM B, running on the same core, one on thread 0 and the
other one on thread 1, at the same time.

And if VM A's vcpu, running on thread 0, exits, then VM B's vcpu
running in guest more on thread 1 can read host memory, as it is
speculatively accessed (either "normally" or because of cache load
gadgets) and brought in L1D cache by thread 0. And Indeed I do see how
ASI protects us from this attack scenario.

However, when the two VMs' vcpus are both running in guest mode, each
one on a thread of the same core, VM B's vcpu running on thread 1 can
exploit L1TF to peek at and steal secrets that VM A's vcpu, running on
thread 0, is accessing, as they're brought into L1D cache... can't it? 

How can, ASI *without* core-scheduling, prevent this other attack
scenario?

Because I may very well be missing something, but it looks to me that
it can't. In which case, I'm not sure we can call it "alternative" to
core-scheduling.... Or is the second attack scenario that I tried to
describe above, not considered interesting?

Thanks and Regards
Alexandre Chartre Aug. 22, 2019, 12:31 p.m. UTC | #34
On 7/31/19 6:31 PM, Dario Faggioli wrote:
> Hello all,
> 
> I know this is a bit of an old thread, so apologies for being late to
> the party. :-)

And sorry for the late reply, I was away for a while.

> I would have a question about this:
> 
>>>> On 7/12/19 2:36 PM, Peter Zijlstra wrote:
>>>>> On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre
>>>>> wrote:
>>>>>> On 7/12/19 1:44 PM, Peter Zijlstra wrote:
>>>>>>> AFAIK3 this wants/needs to be combined with core-scheduling
>>>>>>> to be
>>>>>>> useful, but not a single mention of that is anywhere.
>>>>>>
>>>>>> No. This is actually an alternative to core-scheduling.
>>>>>> Eventually, ASI
>>>>>> will kick all sibling hyperthreads when exiting isolation and
>>>>>> it needs to
>>>>>> run with the full kernel page-table (note that's currently
>>>>>> not in these
>>>>>> patches).
>>
> I.e., about the fact that ASI is presented as an alternative to
> core-scheduling or, at least, as it will only need integrate a small
> subset of the logic (and of the code) from core-scheduling, as said
> here:
> 
>> I haven't looked at details about what has been done so far.
>> Hopefully, we
>> can do something not too complex, or reuse a (small) part of co-
>> scheduling.
>>
> Now, sticking to virtualization examples, if you don't have core-
> scheduling, it means that you can have two vcpus, one from VM A and the
> other from VM B, running on the same core, one on thread 0 and the
> other one on thread 1, at the same time.
> 
> And if VM A's vcpu, running on thread 0, exits, then VM B's vcpu
> running in guest more on thread 1 can read host memory, as it is
> speculatively accessed (either "normally" or because of cache load
> gadgets) and brought in L1D cache by thread 0. And Indeed I do see how
> ASI protects us from this attack scenario.
> 
>
> However, when the two VMs' vcpus are both running in guest mode, each
> one on a thread of the same core, VM B's vcpu running on thread 1 can
> exploit L1TF to peek at and steal secrets that VM A's vcpu, running on
> thread 0, is accessing, as they're brought into L1D cache... can't it?
> 
> How can, ASI *without* core-scheduling, prevent this other attack
> scenario?
>
> Because I may very well be missing something, but it looks to me that
> it can't. In which case, I'm not sure we can call it "alternative" to
> core-scheduling.... Or is the second attack scenario that I tried to
> describe above, not considered interesting?
> 

Correct, ASI doesn't prevent this attack scenario. However, this case can
be prevented by pinning each VM to different CPU cores (for example, using
cgroups) so that you never have two different VMs running with CPU threads
from the same CPU core. Of course, this limits the number of VMs you can
run to the number of CPU cores on the system but we assume this is a
reasonable configuration when you want to have high performing VM.

Rgds,

alex.