[RFC,00/47] Address Space Isolation for KVM

Message ID	20220223052223.1202152-1-junaids@google.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> Date: Tue, 22 Feb 2022 21:21:36 -0800 Message-Id: <20220223052223.1202152-1-junaids@google.com> Mime-Version: 1.0 Subject: [RFC PATCH 00/47] Address Space Isolation for KVM From: Junaid Shahid <junaids@google.com> To: linux-kernel@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, jmattson@google.com, pjt@google.com, oweisse@google.com, alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, peterz@infradead.org, tglx@linutronix.de, luto@kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Address Space Isolation for KVM \| expand [RFC,00/47] Address Space Isolation for KVM [RFC,01/47] mm: asi: Introduce ASI core API [RFC,02/47] mm: asi: Add command-line parameter to enable/disable ASI [RFC,03/47] mm: asi: Switch to unrestricted address space when entering scheduler [RFC,04/47] mm: asi: ASI support in interrupts/exceptions [RFC,05/47] mm: asi: Make __get_current_cr3_fast() ASI-aware [RFC,06/47] mm: asi: ASI page table allocation and free functions [RFC,07/47] mm: asi: Functions to map/unmap a memory range into ASI page tables [RFC,08/47] mm: asi: Add basic infrastructure for global non-sensitive mappings [RFC,09/47] mm: Add __PAGEFLAG_FALSE [RFC,10/47] mm: asi: Support for global non-sensitive direct map allocations [RFC,11/47] mm: asi: Global non-sensitive vmalloc/vmap support [RFC,12/47] mm: asi: Support for global non-sensitive slab caches [RFC,13/47] asi: Added ASI memory cgroup flag [RFC,14/47] mm: asi: Disable ASI API when ASI is not enabled for a process [RFC,15/47] kvm: asi: Restricted address space for VM execution [RFC,16/47] mm: asi: Support for mapping non-sensitive pcpu chunks [RFC,17/47] mm: asi: Aliased direct map for local non-sensitive allocations [RFC,18/47] mm: asi: Support for pre-ASI-init local non-sensitive allocations [RFC,19/47] mm: asi: Support for locally nonsensitive page allocations [RFC,20/47] mm: asi: Support for locally non-sensitive vmalloc allocations [RFC,21/47] mm: asi: Add support for locally non-sensitive VM_USERMAP pages [RFC,22/47] mm: asi: Added refcounting when initilizing an asi [RFC,23/47] mm: asi: Add support for mapping all userspace memory into ASI [RFC,24/47] mm: asi: Support for local non-sensitive slab caches [RFC,25/47] mm: asi: Avoid warning from NMI userspace accesses in ASI context [RFC,26/47] mm: asi: Use separate PCIDs for restricted address spaces [RFC,27/47] mm: asi: Avoid TLB flushes during ASI CR3 switches when possible [RFC,28/47] mm: asi: Avoid TLB flush IPIs to CPUs not in ASI context [RFC,29/47] mm: asi: Reduce TLB flushes when freeing pages asynchronously [RFC,30/47] mm: asi: Add API for mapping userspace address ranges [RFC,31/47] mm: asi: Support for non-sensitive SLUB caches [RFC,32/47] x86: asi: Allocate FPU state separately when ASI is enabled. [RFC,33/47] kvm: asi: Map guest memory into restricted ASI address space [RFC,34/47] kvm: asi: Unmap guest memory from ASI address space when using nested virt [RFC,35/47] mm: asi: asi_exit() on PF, skip handling if address is accessible [RFC,36/47] mm: asi: Adding support for dynamic percpu ASI allocations [RFC,37/47] mm: asi: ASI annotation support for static variables. [RFC,38/47] mm: asi: ASI annotation support for dynamic modules. [RFC,39/47] mm: asi: Skip conventional L1TF/MDS mitigations [RFC,40/47] mm: asi: support for static percpu DEFINE_PER_CPU*_ASI [RFC,41/47] mm: asi: Annotation of static variables to be nonsensitive [RFC,42/47] mm: asi: Annotation of PERCPU variables to be nonsensitive [RFC,43/47] mm: asi: Annotation of dynamic variables to be nonsensitive [RFC,45/47] mm: asi: Mapping global nonsensitive areas in asi_global_init [RFC,46/47] kvm: asi: Do asi_exit() in vcpu_run loop before returning to userspace [RFC,47/47] mm: asi: Properly un/mapping task stack from ASI + tlb flush

Junaid Shahid Feb. 23, 2022, 5:21 a.m. UTC

This patch series is a proof-of-concept RFC for an end-to-end implementation of 
Address Space Isolation for KVM. It has similar goals and a somewhat similar 
high-level design as the original ASI patches from Alexandre Chartre 
([1],[2],[3],[4]), but with a different underlying implementation. This also 
includes several memory management changes to help with differentiating between 
sensitive and non-sensitive memory and mapping of non-sensitive memory into the 
ASI restricted address spaces. 

This RFC is intended as a demonstration of what a full ASI implementation for 
KVM could look like, not necessarily as a direct proposal for what might 
eventually be merged. In particular, these patches do not yet implement KPTI on 
top of ASI, although the framework is generic enough to be able to support it. 
Similarly, these patches do not include non-sensitive annotations for data 
structures that did not get frequently accessed during execution of our test 
workloads, but the framework is designed such that new non-sensitive memory 
annotations can be added trivially.

The patches apply on top of Linux v5.16. These patches are also available via 
gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.

Background
==========
Address Space Isolation is a comprehensive security mitigation for several types 
of speculative execution attacks. Even though the kernel already has several 
speculative execution vulnerability mitigations, some of them can be quite 
expensive if enabled fully e.g. to fully mitigate L1TF using the existing 
mechanisms requires doing an L1 cache flush on every single VM entry as well as 
disabling hyperthreading altogether. (Although core scheduling can provide some 
protection when hyperthreading is enabled, it is not sufficient by itself to 
protect against all leaks unless sibling hyperthread stunning is also performed 
on every VM exit.) ASI provides a much less expensive mitigation for such 
vulnerabilities while still providing an almost similar level of protection.

There are a couple of basic insights/assumptions behind ASI:

1. Most execution paths in the kernel (especially during virtual machine 
execution) access only memory that is not particularly sensitive even if it were 
to get leaked to the executing process/VM (setting aside for a moment what 
exactly should be considered sensitive or non-sensitive).
2. Even when executing speculatively, the CPU can generally only bring memory 
that is mapped in the current page tables into its various caches and internal 
buffers.

Given these, the idea of using ASI to thwart speculative attacks is that we can 
execute the kernel using a restricted set of page tables most of the time and 
switch to the full unrestricted kernel address space only when the kernel needs 
to access something that is not mapped in the restricted address space. And we 
keep track of when a switch to the full kernel address space is done, so that 
before returning back to the process/VM, we can switch back to the restricted 
address space. In the paths where the kernel is able to execute entirely while 
remaining in the restricted address space, we can skip other mitigations for 
speculative execution attacks (such as L1 cache / micro-arch buffer flushes, 
sibling hyperthread stunning etc.). Only in the cases where we do end up 
switching the page tables, we perform these more expensive mitigations. Assuming 
that happens relatively infrequently, the performance can be significantly 
better compared to performing these mitigations all the time.

Please note that although we do have a sibling hyperthread stunning 
implementation internally, which is fully integrated with KVM-ASI, it is not 
included in this RFC for the time being. The earlier upstream proposal for 
sibling stunning [6] could potentially be integrated into an upstream ASI 
implementation.

Basic concepts
==============
Different types of restricted address spaces are represented by different ASI 
classes. For instance, KVM-ASI is an ASI class used during VM execution. KPTI 
would be another ASI class. An ASI instance (struct asi) represents a single 
restricted address space. There is a separate ASI instance for each untrusted 
context (e.g. a userspace process, a VM, or even a single VCPU etc.) Note that 
there can be multiple untrusted security contexts (and thus multiple restricted 
address spaces) within a single process e.g. in the case of VMs, the userspace 
process is a different security context than the guest VM, and in principle, 
even each VCPU could be considered a separate security context (That would be 
primarily useful for securing nested virtualization).

In this RFC, a process can have at most one ASI instance of each class, though 
this is not an inherent limitation and multiple instances of the same class 
should eventually be supported. (A process can still have ASI instances of 
different classes e.g. KVM-ASI and KPTI.) In fact, in principle, it is not even 
entirely necessary to tie an ASI instance to a process. That is just a 
simplification for the initial implementation.

An asi_enter operation switches into the restricted address space represented by 
the given ASI instance. An asi_exit operation switches to the full unrestricted 
kernel address space. Each ASI class can provide hooks to be executed during 
these operations, which can be used to perform speculative attack mitigations 
relevant to that class. For instance, the KVM-ASI hooks would perform a 
sibling-hyperthread-stun operation in the asi_exit hook, and L1-flush/MDS-clear 
and sibling-hyperthread-unstun operations in the asi_enter hook. On the other 
hand, the hooks for the KPTI class would be NO-OP, since the switching of the 
page tables is enough mitigation in that case.

If the kernel attempts to access memory that is not mapped in the currently 
active ASI instance, the page fault handler automatically performs an asi_exit 
operation. This means that except for a few critical pieces of memory, leaving 
something out of an unrestricted address space will result in only a performance 
hit, rather than a catastrophic failure. The kernel can also perform explicit 
asi_exit operations in some paths as needed.

Apart from the page fault handler, other exceptions and interrupts (even NMIs) 
do not automatically cause an asi_exit and could potentially be executed 
completely within a restricted address space if they don't end up accessing any 
sensitive piece of memory.

The mappings within a restricted address space are always a subset of the full 
kernel address space and each mapping is always the same as the corresponding 
mapping in the full kernel address space. This is necessary because we could 
potentially end up performing an asi_exit at any point.

Although this RFC only includes an implementation of the KVM-ASI class, a KPTI 
class could also be implemented on top of the same infrastructure. Furthermore, 
in the future we could also implement a KPTI-Next class that actually uses the 
ASI model for userspace processes i.e. mapping non-sensitive kernel memory in 
the restricted address space and trying to execute most syscalls/interrupts 
without switching to the full kernel address space, as opposed to the current 
KPTI which requires an address space switch on every kernel/user mode 
transition.

Memory classification
=====================
We divide memory into three categories.

1. Sensitive memory
This is memory that should never get leaked to any process or VM. Sensitive 
memory is only mapped in the unrestricted kernel page tables. By default, all 
memory is considered sensitive unless specifically categorized otherwise.

2. Globally non-sensitive memory
This is memory that does not present a substantial security threat even if it 
were to get leaked to any process or VM in the system. Globally non-sensitive 
memory is mapped in the restricted address spaces for all processes.

3. Locally non-sensitive memory
This is memory that does not present a substantial security threat if it were to 
get leaked to the currently running process or VM, but would present a security 
issue if it were to get leaked to any other process or VM in the system. 
Examples include userspace memory (or guest memory in the case of VMs) or kernel 
structures containing userspace/guest register context etc. Locally 
non-sensitive memory is mapped only in the restricted address space of a single 
process.

Various mechanisms are provided to annotate different types of memory (static, 
buddy allocator, slab, vmalloc etc.) as globally or locally non-sensitive. In 
addition, the ASI infrastructure takes care to ensure that different classes of 
memory do not share the same physical page. This includes separation of 
sensitive, globally non-sensitive and locally non-sensitive memory into 
different pages and also separation of locally non-sensitive memory for 
different processes into different pages as well.

What exactly should be considered non-sensitive (either globally or locally) is 
somewhat open-ended. Some things are clearly sensitive or non-sensitive, but 
many things also fall into a gray area, depending on how paranoid one wants to 
be. For this proof of concept, we have generally treated such things as 
non-sensitive, though that may not necessarily be the ideal classification in 
each case. Similarly, there is also a gray area between globally and locally 
non-sensitive classifications in some cases, and in those cases this RFC has 
mostly erred on the side of marking them as locally non-sensitive, even though 
many of those cases could likely be safely classified as globally non-sensitive.

Although this implementation includes fairly extensive support for marking most 
types of dynamically allocated memory as locally non-sensitive, it is possibly 
feasible, at least for KVM-ASI, to get away with a simpler implementation (such 
as [5]), if we are very selective about what memory we treat as locally 
non-sensitive (as opposed to globally non-sensitive). Nevertheless, the more 
general mechanism is included in this proof of concept as an illustration for 
what could be done if we really needed to treat any arbitrary kernel memory as 
locally non-sensitive.

It is also possible to have ASI classes that do not utilize the above described 
infrastructure and instead manage all the memory mappings inside the restricted 
address space on their own.


References
==========
[1] https://lore.kernel.org/lkml/1557758315-12667-1-git-send-email-alexandre.chartre@oracle.com
[2] https://lore.kernel.org/lkml/1562855138-19507-1-git-send-email-alexandre.chartre@oracle.com
[3] https://lore.kernel.org/lkml/1582734120-26757-1-git-send-email-alexandre.chartre@oracle.com
[4] https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.chartre@oracle.com
[5] https://lore.kernel.org/lkml/20190612170834.14855-1-mhillenb@amazon.de
[6] https://lore.kernel.org/lkml/20200815031908.1015049-1-joel@joelfernandes.org

Cc: Paul Turner <pjt@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>


Junaid Shahid (32):
  mm: asi: Introduce ASI core API
  mm: asi: Add command-line parameter to enable/disable ASI
  mm: asi: Switch to unrestricted address space when entering scheduler
  mm: asi: ASI support in interrupts/exceptions
  mm: asi: Make __get_current_cr3_fast() ASI-aware
  mm: asi: ASI page table allocation and free functions
  mm: asi: Functions to map/unmap a memory range into ASI page tables
  mm: asi: Add basic infrastructure for global non-sensitive mappings
  mm: Add __PAGEFLAG_FALSE
  mm: asi: Support for global non-sensitive direct map allocations
  mm: asi: Global non-sensitive vmalloc/vmap support
  mm: asi: Support for global non-sensitive slab caches
  mm: asi: Disable ASI API when ASI is not enabled for a process
  kvm: asi: Restricted address space for VM execution
  mm: asi: Support for mapping non-sensitive pcpu chunks
  mm: asi: Aliased direct map for local non-sensitive allocations
  mm: asi: Support for pre-ASI-init local non-sensitive allocations
  mm: asi: Support for locally nonsensitive page allocations
  mm: asi: Support for locally non-sensitive vmalloc allocations
  mm: asi: Add support for locally non-sensitive VM_USERMAP pages
  mm: asi: Add support for mapping all userspace memory into ASI
  mm: asi: Support for local non-sensitive slab caches
  mm: asi: Avoid warning from NMI userspace accesses in ASI context
  mm: asi: Use separate PCIDs for restricted address spaces
  mm: asi: Avoid TLB flushes during ASI CR3 switches when possible
  mm: asi: Avoid TLB flush IPIs to CPUs not in ASI context
  mm: asi: Reduce TLB flushes when freeing pages asynchronously
  mm: asi: Add API for mapping userspace address ranges
  mm: asi: Support for non-sensitive SLUB caches
  x86: asi: Allocate FPU state separately when ASI is enabled.
  kvm: asi: Map guest memory into restricted ASI address space
  kvm: asi: Unmap guest memory from ASI address space when using nested
    virt

Ofir Weisse (15):
  asi: Added ASI memory cgroup flag
  mm: asi: Added refcounting when initilizing an asi
  mm: asi: asi_exit() on PF, skip handling if address is accessible
  mm: asi: Adding support for dynamic percpu ASI allocations
  mm: asi: ASI annotation support for static variables.
  mm: asi: ASI annotation support for dynamic modules.
  mm: asi: Skip conventional L1TF/MDS mitigations
  mm: asi: support for static percpu DEFINE_PER_CPU*_ASI
  mm: asi: Annotation of static variables to be nonsensitive
  mm: asi: Annotation of PERCPU variables to be nonsensitive
  mm: asi: Annotation of dynamic variables to be nonsensitive
  kvm: asi: Splitting kvm_vcpu_arch into non/sensitive parts
  mm: asi: Mapping global nonsensitive areas in asi_global_init
  kvm: asi: Do asi_exit() in vcpu_run loop before returning to userspace
  mm: asi: Properly un/mapping task stack from ASI + tlb flush

 arch/alpha/include/asm/Kbuild            |    1 +
 arch/arc/include/asm/Kbuild              |    1 +
 arch/arm/include/asm/Kbuild              |    1 +
 arch/arm64/include/asm/Kbuild            |    1 +
 arch/csky/include/asm/Kbuild             |    1 +
 arch/h8300/include/asm/Kbuild            |    1 +
 arch/hexagon/include/asm/Kbuild          |    1 +
 arch/ia64/include/asm/Kbuild             |    1 +
 arch/m68k/include/asm/Kbuild             |    1 +
 arch/microblaze/include/asm/Kbuild       |    1 +
 arch/mips/include/asm/Kbuild             |    1 +
 arch/nds32/include/asm/Kbuild            |    1 +
 arch/nios2/include/asm/Kbuild            |    1 +
 arch/openrisc/include/asm/Kbuild         |    1 +
 arch/parisc/include/asm/Kbuild           |    1 +
 arch/powerpc/include/asm/Kbuild          |    1 +
 arch/riscv/include/asm/Kbuild            |    1 +
 arch/s390/include/asm/Kbuild             |    1 +
 arch/sh/include/asm/Kbuild               |    1 +
 arch/sparc/include/asm/Kbuild            |    1 +
 arch/um/include/asm/Kbuild               |    1 +
 arch/x86/events/core.c                   |    6 +-
 arch/x86/events/intel/bts.c              |    2 +-
 arch/x86/events/intel/core.c             |    2 +-
 arch/x86/events/msr.c                    |    2 +-
 arch/x86/events/perf_event.h             |    4 +-
 arch/x86/include/asm/asi.h               |  215 ++++
 arch/x86/include/asm/cpufeatures.h       |    1 +
 arch/x86/include/asm/current.h           |    2 +-
 arch/x86/include/asm/debugreg.h          |    2 +-
 arch/x86/include/asm/desc.h              |    2 +-
 arch/x86/include/asm/disabled-features.h |    8 +-
 arch/x86/include/asm/fpu/api.h           |    3 +-
 arch/x86/include/asm/hardirq.h           |    2 +-
 arch/x86/include/asm/hw_irq.h            |    2 +-
 arch/x86/include/asm/idtentry.h          |   25 +-
 arch/x86/include/asm/kvm_host.h          |  124 +-
 arch/x86/include/asm/page.h              |   19 +-
 arch/x86/include/asm/page_64.h           |   27 +-
 arch/x86/include/asm/page_64_types.h     |   20 +
 arch/x86/include/asm/percpu.h            |    2 +-
 arch/x86/include/asm/pgtable_64_types.h  |   10 +
 arch/x86/include/asm/preempt.h           |    2 +-
 arch/x86/include/asm/processor.h         |   17 +-
 arch/x86/include/asm/smp.h               |    2 +-
 arch/x86/include/asm/tlbflush.h          |   49 +-
 arch/x86/include/asm/topology.h          |    2 +-
 arch/x86/kernel/alternative.c            |    2 +-
 arch/x86/kernel/apic/apic.c              |    2 +-
 arch/x86/kernel/apic/x2apic_cluster.c    |    8 +-
 arch/x86/kernel/cpu/bugs.c               |    2 +-
 arch/x86/kernel/cpu/common.c             |   12 +-
 arch/x86/kernel/e820.c                   |    7 +-
 arch/x86/kernel/fpu/core.c               |   47 +-
 arch/x86/kernel/fpu/init.c               |    7 +-
 arch/x86/kernel/fpu/internal.h           |    1 +
 arch/x86/kernel/fpu/xstate.c             |   21 +-
 arch/x86/kernel/head_64.S                |   12 +
 arch/x86/kernel/hw_breakpoint.c          |    2 +-
 arch/x86/kernel/irq.c                    |    2 +-
 arch/x86/kernel/irqinit.c                |    2 +-
 arch/x86/kernel/nmi.c                    |    6 +-
 arch/x86/kernel/process.c                |   13 +-
 arch/x86/kernel/setup.c                  |    4 +-
 arch/x86/kernel/setup_percpu.c           |    4 +-
 arch/x86/kernel/smp.c                    |    2 +-
 arch/x86/kernel/smpboot.c                |    3 +-
 arch/x86/kernel/traps.c                  |    2 +
 arch/x86/kernel/tsc.c                    |   10 +-
 arch/x86/kernel/vmlinux.lds.S            |    2 +-
 arch/x86/kvm/cpuid.c                     |   18 +-
 arch/x86/kvm/kvm_cache_regs.h            |   22 +-
 arch/x86/kvm/lapic.c                     |   11 +-
 arch/x86/kvm/mmu.h                       |   16 +-
 arch/x86/kvm/mmu/mmu.c                   |  209 ++--
 arch/x86/kvm/mmu/mmu_internal.h          |    2 +-
 arch/x86/kvm/mmu/paging_tmpl.h           |   40 +-
 arch/x86/kvm/mmu/spte.c                  |    6 +-
 arch/x86/kvm/mmu/spte.h                  |    2 +-
 arch/x86/kvm/mmu/tdp_mmu.c               |   14 +-
 arch/x86/kvm/mtrr.c                      |    2 +-
 arch/x86/kvm/svm/nested.c                |   34 +-
 arch/x86/kvm/svm/sev.c                   |   70 +-
 arch/x86/kvm/svm/svm.c                   |   52 +-
 arch/x86/kvm/trace.h                     |   10 +-
 arch/x86/kvm/vmx/capabilities.h          |   14 +-
 arch/x86/kvm/vmx/nested.c                |   90 +-
 arch/x86/kvm/vmx/vmx.c                   |  152 ++-
 arch/x86/kvm/x86.c                       |  315 +++--
 arch/x86/kvm/x86.h                       |    4 +-
 arch/x86/mm/Makefile                     |    1 +
 arch/x86/mm/asi.c                        | 1397 ++++++++++++++++++++++
 arch/x86/mm/fault.c                      |   67 +-
 arch/x86/mm/init.c                       |    7 +-
 arch/x86/mm/init_64.c                    |   26 +-
 arch/x86/mm/kaslr.c                      |   34 +-
 arch/x86/mm/mm_internal.h                |    5 +
 arch/x86/mm/physaddr.c                   |    8 +
 arch/x86/mm/tlb.c                        |  419 ++++++-
 arch/xtensa/include/asm/Kbuild           |    1 +
 fs/binfmt_elf.c                          |    2 +-
 fs/eventfd.c                             |    2 +-
 fs/eventpoll.c                           |   10 +-
 fs/exec.c                                |    7 +
 fs/file.c                                |    3 +-
 fs/timerfd.c                             |    2 +-
 include/asm-generic/asi.h                |  149 +++
 include/asm-generic/irq_regs.h           |    2 +-
 include/asm-generic/percpu.h             |    6 +
 include/asm-generic/vmlinux.lds.h        |   36 +-
 include/linux/arch_topology.h            |    2 +-
 include/linux/debug_locks.h              |    4 +-
 include/linux/gfp.h                      |   13 +-
 include/linux/hrtimer.h                  |    2 +-
 include/linux/interrupt.h                |    2 +-
 include/linux/jiffies.h                  |    4 +-
 include/linux/kernel_stat.h              |    4 +-
 include/linux/kvm_host.h                 |    7 +-
 include/linux/kvm_types.h                |    3 +
 include/linux/memcontrol.h               |    3 +
 include/linux/mm_types.h                 |   59 +
 include/linux/module.h                   |   15 +
 include/linux/notifier.h                 |    2 +-
 include/linux/page-flags.h               |   19 +
 include/linux/percpu-defs.h              |   39 +
 include/linux/percpu.h                   |    8 +-
 include/linux/pgtable.h                  |    3 +
 include/linux/prandom.h                  |    2 +-
 include/linux/profile.h                  |    2 +-
 include/linux/rcupdate.h                 |    4 +-
 include/linux/rcutree.h                  |    2 +-
 include/linux/sched.h                    |    5 +
 include/linux/sched/mm.h                 |   12 +
 include/linux/sched/sysctl.h             |    1 +
 include/linux/slab.h                     |   68 +-
 include/linux/slab_def.h                 |    4 +
 include/linux/slub_def.h                 |    6 +
 include/linux/vmalloc.h                  |   16 +-
 include/trace/events/mmflags.h           |   14 +-
 init/main.c                              |    2 +-
 kernel/cgroup/cgroup.c                   |    9 +-
 kernel/cpu.c                             |   14 +-
 kernel/entry/common.c                    |    6 +
 kernel/events/core.c                     |   25 +-
 kernel/exit.c                            |    2 +
 kernel/fork.c                            |   69 +-
 kernel/freezer.c                         |    2 +-
 kernel/irq_work.c                        |    6 +-
 kernel/locking/lockdep.c                 |   14 +-
 kernel/module-internal.h                 |    1 +
 kernel/module.c                          |  210 +++-
 kernel/panic.c                           |    2 +-
 kernel/printk/printk.c                   |    4 +-
 kernel/profile.c                         |    4 +-
 kernel/rcu/srcutree.c                    |    3 +-
 kernel/rcu/tree.c                        |   12 +-
 kernel/rcu/update.c                      |    4 +-
 kernel/sched/clock.c                     |    2 +-
 kernel/sched/core.c                      |   23 +-
 kernel/sched/cpuacct.c                   |   10 +-
 kernel/sched/cpufreq.c                   |    3 +-
 kernel/sched/cputime.c                   |    4 +-
 kernel/sched/fair.c                      |    7 +-
 kernel/sched/loadavg.c                   |    2 +-
 kernel/sched/rt.c                        |    2 +-
 kernel/sched/sched.h                     |   25 +-
 kernel/sched/topology.c                  |   28 +-
 kernel/smp.c                             |   26 +-
 kernel/softirq.c                         |    5 +-
 kernel/time/hrtimer.c                    |    4 +-
 kernel/time/jiffies.c                    |    8 +-
 kernel/time/ntp.c                        |   30 +-
 kernel/time/tick-common.c                |    6 +-
 kernel/time/tick-internal.h              |    6 +-
 kernel/time/tick-sched.c                 |    4 +-
 kernel/time/timekeeping.c                |   10 +-
 kernel/time/timekeeping.h                |    2 +-
 kernel/time/timer.c                      |    4 +-
 kernel/trace/ring_buffer.c               |    5 +-
 kernel/trace/trace.c                     |    4 +-
 kernel/trace/trace_preemptirq.c          |    2 +-
 kernel/trace/trace_sched_switch.c        |    4 +-
 kernel/tracepoint.c                      |    2 +-
 kernel/watchdog.c                        |   12 +-
 lib/debug_locks.c                        |    5 +-
 lib/irq_regs.c                           |    2 +-
 lib/radix-tree.c                         |    6 +-
 lib/random32.c                           |    3 +-
 mm/init-mm.c                             |    2 +
 mm/internal.h                            |    3 +
 mm/memcontrol.c                          |   37 +-
 mm/memory.c                              |    4 +-
 mm/page_alloc.c                          |  204 +++-
 mm/percpu-internal.h                     |   23 +-
 mm/percpu-km.c                           |    5 +-
 mm/percpu-vm.c                           |   57 +-
 mm/percpu.c                              |  273 ++++-
 mm/slab.c                                |   42 +-
 mm/slab.h                                |  166 ++-
 mm/slab_common.c                         |  461 ++++++-
 mm/slub.c                                |  140 ++-
 mm/sparse.c                              |    4 +-
 mm/util.c                                |    3 +-
 mm/vmalloc.c                             |  193 ++-
 net/core/skbuff.c                        |    2 +-
 net/core/sock.c                          |    2 +-
 security/Kconfig                         |   12 +
 tools/perf/builtin-kmem.c                |    2 +
 virt/kvm/coalesced_mmio.c                |    2 +-
 virt/kvm/eventfd.c                       |    5 +-
 virt/kvm/kvm_main.c                      |   61 +-
 211 files changed, 5727 insertions(+), 959 deletions(-)
 create mode 100644 arch/x86/include/asm/asi.h
 create mode 100644 arch/x86/mm/asi.c
 create mode 100644 include/asm-generic/asi.h

Hyeonggon Yoo March 5, 2022, 3:39 a.m. UTC | #1

On Tue, Feb 22, 2022 at 09:21:36PM -0800, Junaid Shahid wrote:
> This patch series is a proof-of-concept RFC for an end-to-end implementation of 
> Address Space Isolation for KVM. It has similar goals and a somewhat similar 
> high-level design as the original ASI patches from Alexandre Chartre 
> ([1],[2],[3],[4]), but with a different underlying implementation. This also 
> includes several memory management changes to help with differentiating between 
> sensitive and non-sensitive memory and mapping of non-sensitive memory into the 
> ASI restricted address spaces. 
> 
> This RFC is intended as a demonstration of what a full ASI implementation for 
> KVM could look like, not necessarily as a direct proposal for what might 
> eventually be merged. In particular, these patches do not yet implement KPTI on 
> top of ASI, although the framework is generic enough to be able to support it. 
> Similarly, these patches do not include non-sensitive annotations for data 
> structures that did not get frequently accessed during execution of our test 
> workloads, but the framework is designed such that new non-sensitive memory 
> annotations can be added trivially.
> 
> The patches apply on top of Linux v5.16. These patches are also available via 
> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.

[+Cc slab maintainers/reviewers]

Please Cc relevant people.
patch 14, 24, 31 need to be reviewed by slab people :)

> Background
> ==========
> Address Space Isolation is a comprehensive security mitigation for several types 
> of speculative execution attacks. Even though the kernel already has several 
> speculative execution vulnerability mitigations, some of them can be quite 
> expensive if enabled fully e.g. to fully mitigate L1TF using the existing 
> mechanisms requires doing an L1 cache flush on every single VM entry as well as 
> disabling hyperthreading altogether. (Although core scheduling can provide some 
> protection when hyperthreading is enabled, it is not sufficient by itself to 
> protect against all leaks unless sibling hyperthread stunning is also performed 
> on every VM exit.) ASI provides a much less expensive mitigation for such 
> vulnerabilities while still providing an almost similar level of protection.
> 
> There are a couple of basic insights/assumptions behind ASI:
> 
> 1. Most execution paths in the kernel (especially during virtual machine 
> execution) access only memory that is not particularly sensitive even if it were 
> to get leaked to the executing process/VM (setting aside for a moment what 
> exactly should be considered sensitive or non-sensitive).
> 2. Even when executing speculatively, the CPU can generally only bring memory 
> that is mapped in the current page tables into its various caches and internal 
> buffers.
> 
> Given these, the idea of using ASI to thwart speculative attacks is that we can 
> execute the kernel using a restricted set of page tables most of the time and 
> switch to the full unrestricted kernel address space only when the kernel needs 
> to access something that is not mapped in the restricted address space. And we 
> keep track of when a switch to the full kernel address space is done, so that 
> before returning back to the process/VM, we can switch back to the restricted 
> address space. In the paths where the kernel is able to execute entirely while 
> remaining in the restricted address space, we can skip other mitigations for 
> speculative execution attacks (such as L1 cache / micro-arch buffer flushes, 
> sibling hyperthread stunning etc.). Only in the cases where we do end up 
> switching the page tables, we perform these more expensive mitigations. Assuming 
> that happens relatively infrequently, the performance can be significantly 
> better compared to performing these mitigations all the time.
> 
> Please note that although we do have a sibling hyperthread stunning 
> implementation internally, which is fully integrated with KVM-ASI, it is not 
> included in this RFC for the time being. The earlier upstream proposal for 
> sibling stunning [6] could potentially be integrated into an upstream ASI 
> implementation.
> 
> Basic concepts
> ==============
> Different types of restricted address spaces are represented by different ASI 
> classes. For instance, KVM-ASI is an ASI class used during VM execution. KPTI 
> would be another ASI class. An ASI instance (struct asi) represents a single 
> restricted address space. There is a separate ASI instance for each untrusted 
> context (e.g. a userspace process, a VM, or even a single VCPU etc.) Note that 
> there can be multiple untrusted security contexts (and thus multiple restricted 
> address spaces) within a single process e.g. in the case of VMs, the userspace 
> process is a different security context than the guest VM, and in principle, 
> even each VCPU could be considered a separate security context (That would be 
> primarily useful for securing nested virtualization).
> 
> In this RFC, a process can have at most one ASI instance of each class, though 
> this is not an inherent limitation and multiple instances of the same class 
> should eventually be supported. (A process can still have ASI instances of 
> different classes e.g. KVM-ASI and KPTI.) In fact, in principle, it is not even 
> entirely necessary to tie an ASI instance to a process. That is just a 
> simplification for the initial implementation.
> 
> An asi_enter operation switches into the restricted address space represented by 
> the given ASI instance. An asi_exit operation switches to the full unrestricted 
> kernel address space. Each ASI class can provide hooks to be executed during 
> these operations, which can be used to perform speculative attack mitigations 
> relevant to that class. For instance, the KVM-ASI hooks would perform a 
> sibling-hyperthread-stun operation in the asi_exit hook, and L1-flush/MDS-clear 
> and sibling-hyperthread-unstun operations in the asi_enter hook. On the other 
> hand, the hooks for the KPTI class would be NO-OP, since the switching of the 
> page tables is enough mitigation in that case.
> 
> If the kernel attempts to access memory that is not mapped in the currently 
> active ASI instance, the page fault handler automatically performs an asi_exit 
> operation. This means that except for a few critical pieces of memory, leaving 
> something out of an unrestricted address space will result in only a performance 
> hit, rather than a catastrophic failure. The kernel can also perform explicit 
> asi_exit operations in some paths as needed.
> 
> Apart from the page fault handler, other exceptions and interrupts (even NMIs) 
> do not automatically cause an asi_exit and could potentially be executed 
> completely within a restricted address space if they don't end up accessing any 
> sensitive piece of memory.
> 
> The mappings within a restricted address space are always a subset of the full 
> kernel address space and each mapping is always the same as the corresponding 
> mapping in the full kernel address space. This is necessary because we could 
> potentially end up performing an asi_exit at any point.
> 
> Although this RFC only includes an implementation of the KVM-ASI class, a KPTI 
> class could also be implemented on top of the same infrastructure. Furthermore, 
> in the future we could also implement a KPTI-Next class that actually uses the 
> ASI model for userspace processes i.e. mapping non-sensitive kernel memory in 
> the restricted address space and trying to execute most syscalls/interrupts 
> without switching to the full kernel address space, as opposed to the current 
> KPTI which requires an address space switch on every kernel/user mode 
> transition.
> 
> Memory classification
> =====================
> We divide memory into three categories.
> 
> 1. Sensitive memory
> This is memory that should never get leaked to any process or VM. Sensitive 
> memory is only mapped in the unrestricted kernel page tables. By default, all 
> memory is considered sensitive unless specifically categorized otherwise.
> 
> 2. Globally non-sensitive memory
> This is memory that does not present a substantial security threat even if it 
> were to get leaked to any process or VM in the system. Globally non-sensitive 
> memory is mapped in the restricted address spaces for all processes.
> 
> 3. Locally non-sensitive memory
> This is memory that does not present a substantial security threat if it were to 
> get leaked to the currently running process or VM, but would present a security 
> issue if it were to get leaked to any other process or VM in the system. 
> Examples include userspace memory (or guest memory in the case of VMs) or kernel 
> structures containing userspace/guest register context etc. Locally 
> non-sensitive memory is mapped only in the restricted address space of a single 
> process.
> 
> Various mechanisms are provided to annotate different types of memory (static, 
> buddy allocator, slab, vmalloc etc.) as globally or locally non-sensitive. In 
> addition, the ASI infrastructure takes care to ensure that different classes of 
> memory do not share the same physical page. This includes separation of 
> sensitive, globally non-sensitive and locally non-sensitive memory into 
> different pages and also separation of locally non-sensitive memory for 
> different processes into different pages as well.
> 
> What exactly should be considered non-sensitive (either globally or locally) is 
> somewhat open-ended. Some things are clearly sensitive or non-sensitive, but 
> many things also fall into a gray area, depending on how paranoid one wants to 
> be. For this proof of concept, we have generally treated such things as 
> non-sensitive, though that may not necessarily be the ideal classification in 
> each case. Similarly, there is also a gray area between globally and locally 
> non-sensitive classifications in some cases, and in those cases this RFC has 
> mostly erred on the side of marking them as locally non-sensitive, even though 
> many of those cases could likely be safely classified as globally non-sensitive.
> 
> Although this implementation includes fairly extensive support for marking most 
> types of dynamically allocated memory as locally non-sensitive, it is possibly 
> feasible, at least for KVM-ASI, to get away with a simpler implementation (such 
> as [5]), if we are very selective about what memory we treat as locally 
> non-sensitive (as opposed to globally non-sensitive). Nevertheless, the more 
> general mechanism is included in this proof of concept as an illustration for 
> what could be done if we really needed to treat any arbitrary kernel memory as 
> locally non-sensitive.
> 
> It is also possible to have ASI classes that do not utilize the above described 
> infrastructure and instead manage all the memory mappings inside the restricted 
> address space on their own.
> 
> 
> References
> ==========
> [1] https://lore.kernel.org/lkml/1557758315-12667-1-git-send-email-alexandre.chartre@oracle.com
> [2] https://lore.kernel.org/lkml/1562855138-19507-1-git-send-email-alexandre.chartre@oracle.com
> [3] https://lore.kernel.org/lkml/1582734120-26757-1-git-send-email-alexandre.chartre@oracle.com
> [4] https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.chartre@oracle.com
> [5] https://lore.kernel.org/lkml/20190612170834.14855-1-mhillenb@amazon.de
> [6] https://lore.kernel.org/lkml/20200815031908.1015049-1-joel@joelfernandes.org
> 
> Cc: Paul Turner <pjt@google.com>
> Cc: Jim Mattson <jmattson@google.com>
> Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Andy Lutomirski <luto@kernel.org>
> 
> 
> Junaid Shahid (32):
>   mm: asi: Introduce ASI core API
>   mm: asi: Add command-line parameter to enable/disable ASI
>   mm: asi: Switch to unrestricted address space when entering scheduler
>   mm: asi: ASI support in interrupts/exceptions
>   mm: asi: Make __get_current_cr3_fast() ASI-aware
>   mm: asi: ASI page table allocation and free functions
>   mm: asi: Functions to map/unmap a memory range into ASI page tables
>   mm: asi: Add basic infrastructure for global non-sensitive mappings
>   mm: Add __PAGEFLAG_FALSE
>   mm: asi: Support for global non-sensitive direct map allocations
>   mm: asi: Global non-sensitive vmalloc/vmap support
>   mm: asi: Support for global non-sensitive slab caches
>   mm: asi: Disable ASI API when ASI is not enabled for a process
>   kvm: asi: Restricted address space for VM execution
>   mm: asi: Support for mapping non-sensitive pcpu chunks
>   mm: asi: Aliased direct map for local non-sensitive allocations
>   mm: asi: Support for pre-ASI-init local non-sensitive allocations
>   mm: asi: Support for locally nonsensitive page allocations
>   mm: asi: Support for locally non-sensitive vmalloc allocations
>   mm: asi: Add support for locally non-sensitive VM_USERMAP pages
>   mm: asi: Add support for mapping all userspace memory into ASI
>   mm: asi: Support for local non-sensitive slab caches
>   mm: asi: Avoid warning from NMI userspace accesses in ASI context
>   mm: asi: Use separate PCIDs for restricted address spaces
>   mm: asi: Avoid TLB flushes during ASI CR3 switches when possible
>   mm: asi: Avoid TLB flush IPIs to CPUs not in ASI context
>   mm: asi: Reduce TLB flushes when freeing pages asynchronously
>   mm: asi: Add API for mapping userspace address ranges
>   mm: asi: Support for non-sensitive SLUB caches
>   x86: asi: Allocate FPU state separately when ASI is enabled.
>   kvm: asi: Map guest memory into restricted ASI address space
>   kvm: asi: Unmap guest memory from ASI address space when using nested
>     virt
> 
> Ofir Weisse (15):
>   asi: Added ASI memory cgroup flag
>   mm: asi: Added refcounting when initilizing an asi
>   mm: asi: asi_exit() on PF, skip handling if address is accessible
>   mm: asi: Adding support for dynamic percpu ASI allocations
>   mm: asi: ASI annotation support for static variables.
>   mm: asi: ASI annotation support for dynamic modules.
>   mm: asi: Skip conventional L1TF/MDS mitigations
>   mm: asi: support for static percpu DEFINE_PER_CPU*_ASI
>   mm: asi: Annotation of static variables to be nonsensitive
>   mm: asi: Annotation of PERCPU variables to be nonsensitive
>   mm: asi: Annotation of dynamic variables to be nonsensitive
>   kvm: asi: Splitting kvm_vcpu_arch into non/sensitive parts
>   mm: asi: Mapping global nonsensitive areas in asi_global_init
>   kvm: asi: Do asi_exit() in vcpu_run loop before returning to userspace
>   mm: asi: Properly un/mapping task stack from ASI + tlb flush
> 
>  arch/alpha/include/asm/Kbuild            |    1 +
>  arch/arc/include/asm/Kbuild              |    1 +
>  arch/arm/include/asm/Kbuild              |    1 +
>  arch/arm64/include/asm/Kbuild            |    1 +
>  arch/csky/include/asm/Kbuild             |    1 +
>  arch/h8300/include/asm/Kbuild            |    1 +
>  arch/hexagon/include/asm/Kbuild          |    1 +
>  arch/ia64/include/asm/Kbuild             |    1 +
>  arch/m68k/include/asm/Kbuild             |    1 +
>  arch/microblaze/include/asm/Kbuild       |    1 +
>  arch/mips/include/asm/Kbuild             |    1 +
>  arch/nds32/include/asm/Kbuild            |    1 +
>  arch/nios2/include/asm/Kbuild            |    1 +
>  arch/openrisc/include/asm/Kbuild         |    1 +
>  arch/parisc/include/asm/Kbuild           |    1 +
>  arch/powerpc/include/asm/Kbuild          |    1 +
>  arch/riscv/include/asm/Kbuild            |    1 +
>  arch/s390/include/asm/Kbuild             |    1 +
>  arch/sh/include/asm/Kbuild               |    1 +
>  arch/sparc/include/asm/Kbuild            |    1 +
>  arch/um/include/asm/Kbuild               |    1 +
>  arch/x86/events/core.c                   |    6 +-
>  arch/x86/events/intel/bts.c              |    2 +-
>  arch/x86/events/intel/core.c             |    2 +-
>  arch/x86/events/msr.c                    |    2 +-
>  arch/x86/events/perf_event.h             |    4 +-
>  arch/x86/include/asm/asi.h               |  215 ++++
>  arch/x86/include/asm/cpufeatures.h       |    1 +
>  arch/x86/include/asm/current.h           |    2 +-
>  arch/x86/include/asm/debugreg.h          |    2 +-
>  arch/x86/include/asm/desc.h              |    2 +-
>  arch/x86/include/asm/disabled-features.h |    8 +-
>  arch/x86/include/asm/fpu/api.h           |    3 +-
>  arch/x86/include/asm/hardirq.h           |    2 +-
>  arch/x86/include/asm/hw_irq.h            |    2 +-
>  arch/x86/include/asm/idtentry.h          |   25 +-
>  arch/x86/include/asm/kvm_host.h          |  124 +-
>  arch/x86/include/asm/page.h              |   19 +-
>  arch/x86/include/asm/page_64.h           |   27 +-
>  arch/x86/include/asm/page_64_types.h     |   20 +
>  arch/x86/include/asm/percpu.h            |    2 +-
>  arch/x86/include/asm/pgtable_64_types.h  |   10 +
>  arch/x86/include/asm/preempt.h           |    2 +-
>  arch/x86/include/asm/processor.h         |   17 +-
>  arch/x86/include/asm/smp.h               |    2 +-
>  arch/x86/include/asm/tlbflush.h          |   49 +-
>  arch/x86/include/asm/topology.h          |    2 +-
>  arch/x86/kernel/alternative.c            |    2 +-
>  arch/x86/kernel/apic/apic.c              |    2 +-
>  arch/x86/kernel/apic/x2apic_cluster.c    |    8 +-
>  arch/x86/kernel/cpu/bugs.c               |    2 +-
>  arch/x86/kernel/cpu/common.c             |   12 +-
>  arch/x86/kernel/e820.c                   |    7 +-
>  arch/x86/kernel/fpu/core.c               |   47 +-
>  arch/x86/kernel/fpu/init.c               |    7 +-
>  arch/x86/kernel/fpu/internal.h           |    1 +
>  arch/x86/kernel/fpu/xstate.c             |   21 +-
>  arch/x86/kernel/head_64.S                |   12 +
>  arch/x86/kernel/hw_breakpoint.c          |    2 +-
>  arch/x86/kernel/irq.c                    |    2 +-
>  arch/x86/kernel/irqinit.c                |    2 +-
>  arch/x86/kernel/nmi.c                    |    6 +-
>  arch/x86/kernel/process.c                |   13 +-
>  arch/x86/kernel/setup.c                  |    4 +-
>  arch/x86/kernel/setup_percpu.c           |    4 +-
>  arch/x86/kernel/smp.c                    |    2 +-
>  arch/x86/kernel/smpboot.c                |    3 +-
>  arch/x86/kernel/traps.c                  |    2 +
>  arch/x86/kernel/tsc.c                    |   10 +-
>  arch/x86/kernel/vmlinux.lds.S            |    2 +-
>  arch/x86/kvm/cpuid.c                     |   18 +-
>  arch/x86/kvm/kvm_cache_regs.h            |   22 +-
>  arch/x86/kvm/lapic.c                     |   11 +-
>  arch/x86/kvm/mmu.h                       |   16 +-
>  arch/x86/kvm/mmu/mmu.c                   |  209 ++--
>  arch/x86/kvm/mmu/mmu_internal.h          |    2 +-
>  arch/x86/kvm/mmu/paging_tmpl.h           |   40 +-
>  arch/x86/kvm/mmu/spte.c                  |    6 +-
>  arch/x86/kvm/mmu/spte.h                  |    2 +-
>  arch/x86/kvm/mmu/tdp_mmu.c               |   14 +-
>  arch/x86/kvm/mtrr.c                      |    2 +-
>  arch/x86/kvm/svm/nested.c                |   34 +-
>  arch/x86/kvm/svm/sev.c                   |   70 +-
>  arch/x86/kvm/svm/svm.c                   |   52 +-
>  arch/x86/kvm/trace.h                     |   10 +-
>  arch/x86/kvm/vmx/capabilities.h          |   14 +-
>  arch/x86/kvm/vmx/nested.c                |   90 +-
>  arch/x86/kvm/vmx/vmx.c                   |  152 ++-
>  arch/x86/kvm/x86.c                       |  315 +++--
>  arch/x86/kvm/x86.h                       |    4 +-
>  arch/x86/mm/Makefile                     |    1 +
>  arch/x86/mm/asi.c                        | 1397 ++++++++++++++++++++++
>  arch/x86/mm/fault.c                      |   67 +-
>  arch/x86/mm/init.c                       |    7 +-
>  arch/x86/mm/init_64.c                    |   26 +-
>  arch/x86/mm/kaslr.c                      |   34 +-
>  arch/x86/mm/mm_internal.h                |    5 +
>  arch/x86/mm/physaddr.c                   |    8 +
>  arch/x86/mm/tlb.c                        |  419 ++++++-
>  arch/xtensa/include/asm/Kbuild           |    1 +
>  fs/binfmt_elf.c                          |    2 +-
>  fs/eventfd.c                             |    2 +-
>  fs/eventpoll.c                           |   10 +-
>  fs/exec.c                                |    7 +
>  fs/file.c                                |    3 +-
>  fs/timerfd.c                             |    2 +-
>  include/asm-generic/asi.h                |  149 +++
>  include/asm-generic/irq_regs.h           |    2 +-
>  include/asm-generic/percpu.h             |    6 +
>  include/asm-generic/vmlinux.lds.h        |   36 +-
>  include/linux/arch_topology.h            |    2 +-
>  include/linux/debug_locks.h              |    4 +-
>  include/linux/gfp.h                      |   13 +-
>  include/linux/hrtimer.h                  |    2 +-
>  include/linux/interrupt.h                |    2 +-
>  include/linux/jiffies.h                  |    4 +-
>  include/linux/kernel_stat.h              |    4 +-
>  include/linux/kvm_host.h                 |    7 +-
>  include/linux/kvm_types.h                |    3 +
>  include/linux/memcontrol.h               |    3 +
>  include/linux/mm_types.h                 |   59 +
>  include/linux/module.h                   |   15 +
>  include/linux/notifier.h                 |    2 +-
>  include/linux/page-flags.h               |   19 +
>  include/linux/percpu-defs.h              |   39 +
>  include/linux/percpu.h                   |    8 +-
>  include/linux/pgtable.h                  |    3 +
>  include/linux/prandom.h                  |    2 +-
>  include/linux/profile.h                  |    2 +-
>  include/linux/rcupdate.h                 |    4 +-
>  include/linux/rcutree.h                  |    2 +-
>  include/linux/sched.h                    |    5 +
>  include/linux/sched/mm.h                 |   12 +
>  include/linux/sched/sysctl.h             |    1 +
>  include/linux/slab.h                     |   68 +-
>  include/linux/slab_def.h                 |    4 +
>  include/linux/slub_def.h                 |    6 +
>  include/linux/vmalloc.h                  |   16 +-
>  include/trace/events/mmflags.h           |   14 +-
>  init/main.c                              |    2 +-
>  kernel/cgroup/cgroup.c                   |    9 +-
>  kernel/cpu.c                             |   14 +-
>  kernel/entry/common.c                    |    6 +
>  kernel/events/core.c                     |   25 +-
>  kernel/exit.c                            |    2 +
>  kernel/fork.c                            |   69 +-
>  kernel/freezer.c                         |    2 +-
>  kernel/irq_work.c                        |    6 +-
>  kernel/locking/lockdep.c                 |   14 +-
>  kernel/module-internal.h                 |    1 +
>  kernel/module.c                          |  210 +++-
>  kernel/panic.c                           |    2 +-
>  kernel/printk/printk.c                   |    4 +-
>  kernel/profile.c                         |    4 +-
>  kernel/rcu/srcutree.c                    |    3 +-
>  kernel/rcu/tree.c                        |   12 +-
>  kernel/rcu/update.c                      |    4 +-
>  kernel/sched/clock.c                     |    2 +-
>  kernel/sched/core.c                      |   23 +-
>  kernel/sched/cpuacct.c                   |   10 +-
>  kernel/sched/cpufreq.c                   |    3 +-
>  kernel/sched/cputime.c                   |    4 +-
>  kernel/sched/fair.c                      |    7 +-
>  kernel/sched/loadavg.c                   |    2 +-
>  kernel/sched/rt.c                        |    2 +-
>  kernel/sched/sched.h                     |   25 +-
>  kernel/sched/topology.c                  |   28 +-
>  kernel/smp.c                             |   26 +-
>  kernel/softirq.c                         |    5 +-
>  kernel/time/hrtimer.c                    |    4 +-
>  kernel/time/jiffies.c                    |    8 +-
>  kernel/time/ntp.c                        |   30 +-
>  kernel/time/tick-common.c                |    6 +-
>  kernel/time/tick-internal.h              |    6 +-
>  kernel/time/tick-sched.c                 |    4 +-
>  kernel/time/timekeeping.c                |   10 +-
>  kernel/time/timekeeping.h                |    2 +-
>  kernel/time/timer.c                      |    4 +-
>  kernel/trace/ring_buffer.c               |    5 +-
>  kernel/trace/trace.c                     |    4 +-
>  kernel/trace/trace_preemptirq.c          |    2 +-
>  kernel/trace/trace_sched_switch.c        |    4 +-
>  kernel/tracepoint.c                      |    2 +-
>  kernel/watchdog.c                        |   12 +-
>  lib/debug_locks.c                        |    5 +-
>  lib/irq_regs.c                           |    2 +-
>  lib/radix-tree.c                         |    6 +-
>  lib/random32.c                           |    3 +-
>  mm/init-mm.c                             |    2 +
>  mm/internal.h                            |    3 +
>  mm/memcontrol.c                          |   37 +-
>  mm/memory.c                              |    4 +-
>  mm/page_alloc.c                          |  204 +++-
>  mm/percpu-internal.h                     |   23 +-
>  mm/percpu-km.c                           |    5 +-
>  mm/percpu-vm.c                           |   57 +-
>  mm/percpu.c                              |  273 ++++-
>  mm/slab.c                                |   42 +-
>  mm/slab.h                                |  166 ++-
>  mm/slab_common.c                         |  461 ++++++-
>  mm/slub.c                                |  140 ++-
>  mm/sparse.c                              |    4 +-
>  mm/util.c                                |    3 +-
>  mm/vmalloc.c                             |  193 ++-
>  net/core/skbuff.c                        |    2 +-
>  net/core/sock.c                          |    2 +-
>  security/Kconfig                         |   12 +
>  tools/perf/builtin-kmem.c                |    2 +
>  virt/kvm/coalesced_mmio.c                |    2 +-
>  virt/kvm/eventfd.c                       |    5 +-
>  virt/kvm/kvm_main.c                      |   61 +-
>  211 files changed, 5727 insertions(+), 959 deletions(-)
>  create mode 100644 arch/x86/include/asm/asi.h
>  create mode 100644 arch/x86/mm/asi.c
>  create mode 100644 include/asm-generic/asi.h
> 
> -- 
> 2.35.1.473.g83b2b277ed-goog
> 
>

Alexandre Chartre March 16, 2022, 9:34 p.m. UTC | #2

Hi Junaid,

On 2/23/22 06:21, Junaid Shahid wrote:
> This patch series is a proof-of-concept RFC for an end-to-end implementation of
> Address Space Isolation for KVM. It has similar goals and a somewhat similar
> high-level design as the original ASI patches from Alexandre Chartre
> ([1],[2],[3],[4]), but with a different underlying implementation. This also
> includes several memory management changes to help with differentiating between
> sensitive and non-sensitive memory and mapping of non-sensitive memory into the
> ASI restricted address spaces.
> 
> This RFC is intended as a demonstration of what a full ASI implementation for
> KVM could look like, not necessarily as a direct proposal for what might
> eventually be merged. In particular, these patches do not yet implement KPTI on
> top of ASI, although the framework is generic enough to be able to support it.
> Similarly, these patches do not include non-sensitive annotations for data
> structures that did not get frequently accessed during execution of our test
> workloads, but the framework is designed such that new non-sensitive memory
> annotations can be added trivially.
> 
> The patches apply on top of Linux v5.16. These patches are also available via
> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.
>
Sorry for the late answer, and thanks for investigating possible ASI
implementations. I have to admit I put ASI on the back-burner for
a while because I am more and more wondering if the complexity of
ASI is worth the benefit, especially given challenges to effectively
exploit flaws that ASI is expected to mitigate, in particular when VMs
are running on dedicated cpu cores, or when core scheduling is used.
So I have been looking at a more simplistic approach (see below, A
Possible Alternative to ASI).

But first, your implementation confirms that KVM-ASI can be broken up
into different parts: pagetable management, ASI core and sibling cpus
synchronization.

Pagetable Management
====================
For ASI, we need to build a pagetable with a subset of the kernel
pagetable mappings. Your solution is interesting as it is provides
a broad solution and also works well with dynamic allocations (while
my approach to copy mappings had several limitations). The drawback
is the extend of your changes which spread over all the mm code
(while the simple solution to copy mappings can be done with a few
self-contained independent functions).

ASI Core
========

KPTI
----
Implementing KPTI with ASI is possible but this is not straight forward.
This requires some special handling in particular in the assembly kernel
entry/exit code for syscall, interrupt and exception (see ASI RFC v4 [4]
as an example) because we are also switching privilege level in addition
of switching the pagetable. So this might be something to consider early
in your implementation to ensure it is effectively compatible with KPTI.

Going beyond KPTI (with a KPTI-next) and trying to execute most
syscalls/interrupts without switching to the full kernel address space
is more challenging, because it would require much more kernel mapping
in the user pagetable, and this would basically defeat the purpose of
KPTI. You can refer to discussions about the RFC to defer CR3 switch
to C code [7] which was an attempt to just reach the kernel entry C
code with a KPTI pagetable.

Interrupts/Exceptions
---------------------
As long as interrupts/exceptions are not expected to be processed with
ASI, it is probably better to explicitly exit ASI before processing an
interrupt/exception, otherwise you will have an extra overhead on each
interrupt/exception to take a page fault and then exit ASI.

This is particularily true if you have want to have KPTI use ASI, and
in that case the ASI exit will need to be done early in the interrupt
and exception assembly entry code.

ASI Hooks
---------
ASI hooks are certainly a good idea to perform specific actions on ASI
enter or exit. However, I am not sure they are appropriate places for cpus
stunning with KVM-ASI. That's because cpus stunning doesn't need to be
done precisely when entering and exiting ASI, and it probably shouldn't be
done there: it should be done right before VMEnter and right after VMExit
(see below).

Sibling CPUs Synchronization
============================
KVM-ASI requires the synchronization of sibling CPUs from the same CPU
core so that when a VM is running then sibling CPUs are running with the
ASI associated with this VM (or an ASI compatible with the VM, depending
on how ASI is defined). That way the VM can only spy on data from ASI
and won't be able to access any sensitive data.

So, right before entering a VM, KVM should ensures that sibling CPUs are
using ASI. If a sibling CPU is not using ASI then KVM can either wait for
that sibling to run ASI, or force it to use ASI (or to become idle).
This behavior should be enforced as long as any sibling is running the
VM. When all siblings are not running the VM then other siblings can run
any code (using ASI or not).

I would be interesting to see the code you use to achieve this, because
I don't get how this is achieved from the description of your sibling
hyperthread stun and unstun mechanism.

Note that this synchronization is critical for ASI to work, in particular
when entering the VM, we need to be absolutely sure that sibling CPUs are
effectively using ASI. The core scheduling sibling stunning code you
referenced [6] uses a mechanism which is fine for userspace synchronization
(the delivery of the IPI forces the sibling to immediately enter the kernel)
but this won't work for ASI as the delivery of the IPI won't guarantee that
the sibling as enter ASI yet. I did some experiments that show that data
will leak if siblings are not perfectly synchronized.

A Possible Alternative to ASI?
=============================
ASI prevents access to sensitive data by unmapping them. On the other
hand, the KVM code somewhat already identifies access to sensitive data
as part of the L1TF/MDS mitigation, and when KVM is about to access
sensitive data then it sets l1tf_flush_l1d to true (so that L1D gets
flushed before VMEnter).

With KVM knowing when it accesses sensitive data, I think we can provide
the same mitigation as ASI by simply allowing KVM code which doesn't
access sensitive data to be run concurrently with a VM. This can be done
by tagging the kernel thread when it enters KVM code which doesn't
access sensitive data, and untagging the thread right before it accesses
sensitive data. And when KVM is about to do a VMEnter then we synchronize
siblings CPUs so that they run threads with the same tag. Sounds familar?
Yes, because that's similar to core scheduling but inside the kernel
(let's call it "kernel core scheduling").

I think the benefit of this approach would be that it should be much
simpler to implement and less invasive than ASI, and it doesn't preclude
to later do ASI: ASI can be done in addition and provide an extra level
of mitigation in case some sensitive is still accessed by KVM. Also it
would provide the critical sibling CPU synchronization mechanism that
we also need with ASI.

I did some prototyping to implement this kernel core scheduling a while
ago (and then get diverted on other stuff) but so far performances have
been abyssal especially when doing a strict synchronization between
sibling CPUs. I am planning go back and do more investigations when I
have cycles but probably not that soon.


alex.

[4] https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.chartre@oracle.com
[6] https://lore.kernel.org/lkml/20200815031908.1015049-1-joel@joelfernandes.org
[7] https://lore.kernel.org/lkml/20201109144425.270789-1-alexandre.chartre@oracle.com


> Background
> ==========
> Address Space Isolation is a comprehensive security mitigation for several types
> of speculative execution attacks. Even though the kernel already has several
> speculative execution vulnerability mitigations, some of them can be quite
> expensive if enabled fully e.g. to fully mitigate L1TF using the existing
> mechanisms requires doing an L1 cache flush on every single VM entry as well as
> disabling hyperthreading altogether. (Although core scheduling can provide some
> protection when hyperthreading is enabled, it is not sufficient by itself to
> protect against all leaks unless sibling hyperthread stunning is also performed
> on every VM exit.) ASI provides a much less expensive mitigation for such
> vulnerabilities while still providing an almost similar level of protection.
> 
> There are a couple of basic insights/assumptions behind ASI:
> 
> 1. Most execution paths in the kernel (especially during virtual machine
> execution) access only memory that is not particularly sensitive even if it were
> to get leaked to the executing process/VM (setting aside for a moment what
> exactly should be considered sensitive or non-sensitive).
> 2. Even when executing speculatively, the CPU can generally only bring memory
> that is mapped in the current page tables into its various caches and internal
> buffers.
> 
> Given these, the idea of using ASI to thwart speculative attacks is that we can
> execute the kernel using a restricted set of page tables most of the time and
> switch to the full unrestricted kernel address space only when the kernel needs
> to access something that is not mapped in the restricted address space. And we
> keep track of when a switch to the full kernel address space is done, so that
> before returning back to the process/VM, we can switch back to the restricted
> address space. In the paths where the kernel is able to execute entirely while
> remaining in the restricted address space, we can skip other mitigations for
> speculative execution attacks (such as L1 cache / micro-arch buffer flushes,
> sibling hyperthread stunning etc.). Only in the cases where we do end up
> switching the page tables, we perform these more expensive mitigations. Assuming
> that happens relatively infrequently, the performance can be significantly
> better compared to performing these mitigations all the time.
> 
> Please note that although we do have a sibling hyperthread stunning
> implementation internally, which is fully integrated with KVM-ASI, it is not
> included in this RFC for the time being. The earlier upstream proposal for
> sibling stunning [6] could potentially be integrated into an upstream ASI
> implementation.
> 
> Basic concepts
> ==============
> Different types of restricted address spaces are represented by different ASI
> classes. For instance, KVM-ASI is an ASI class used during VM execution. KPTI
> would be another ASI class. An ASI instance (struct asi) represents a single
> restricted address space. There is a separate ASI instance for each untrusted
> context (e.g. a userspace process, a VM, or even a single VCPU etc.) Note that
> there can be multiple untrusted security contexts (and thus multiple restricted
> address spaces) within a single process e.g. in the case of VMs, the userspace
> process is a different security context than the guest VM, and in principle,
> even each VCPU could be considered a separate security context (That would be
> primarily useful for securing nested virtualization).
> 
> In this RFC, a process can have at most one ASI instance of each class, though
> this is not an inherent limitation and multiple instances of the same class
> should eventually be supported. (A process can still have ASI instances of
> different classes e.g. KVM-ASI and KPTI.) In fact, in principle, it is not even
> entirely necessary to tie an ASI instance to a process. That is just a
> simplification for the initial implementation.
> 
> An asi_enter operation switches into the restricted address space represented by
> the given ASI instance. An asi_exit operation switches to the full unrestricted
> kernel address space. Each ASI class can provide hooks to be executed during
> these operations, which can be used to perform speculative attack mitigations
> relevant to that class. For instance, the KVM-ASI hooks would perform a
> sibling-hyperthread-stun operation in the asi_exit hook, and L1-flush/MDS-clear
> and sibling-hyperthread-unstun operations in the asi_enter hook. On the other
> hand, the hooks for the KPTI class would be NO-OP, since the switching of the
> page tables is enough mitigation in that case.
> 
> If the kernel attempts to access memory that is not mapped in the currently
> active ASI instance, the page fault handler automatically performs an asi_exit
> operation. This means that except for a few critical pieces of memory, leaving
> something out of an unrestricted address space will result in only a performance
> hit, rather than a catastrophic failure. The kernel can also perform explicit
> asi_exit operations in some paths as needed.
> 
> Apart from the page fault handler, other exceptions and interrupts (even NMIs)
> do not automatically cause an asi_exit and could potentially be executed
> completely within a restricted address space if they don't end up accessing any
> sensitive piece of memory.
> 
> The mappings within a restricted address space are always a subset of the full
> kernel address space and each mapping is always the same as the corresponding
> mapping in the full kernel address space. This is necessary because we could
> potentially end up performing an asi_exit at any point.
> 
> Although this RFC only includes an implementation of the KVM-ASI class, a KPTI
> class could also be implemented on top of the same infrastructure. Furthermore,
> in the future we could also implement a KPTI-Next class that actually uses the
> ASI model for userspace processes i.e. mapping non-sensitive kernel memory in
> the restricted address space and trying to execute most syscalls/interrupts
> without switching to the full kernel address space, as opposed to the current
> KPTI which requires an address space switch on every kernel/user mode
> transition.
> 
> Memory classification
> =====================
> We divide memory into three categories.
> 
> 1. Sensitive memory
> This is memory that should never get leaked to any process or VM. Sensitive
> memory is only mapped in the unrestricted kernel page tables. By default, all
> memory is considered sensitive unless specifically categorized otherwise.
> 
> 2. Globally non-sensitive memory
> This is memory that does not present a substantial security threat even if it
> were to get leaked to any process or VM in the system. Globally non-sensitive
> memory is mapped in the restricted address spaces for all processes.
> 
> 3. Locally non-sensitive memory
> This is memory that does not present a substantial security threat if it were to
> get leaked to the currently running process or VM, but would present a security
> issue if it were to get leaked to any other process or VM in the system.
> Examples include userspace memory (or guest memory in the case of VMs) or kernel
> structures containing userspace/guest register context etc. Locally
> non-sensitive memory is mapped only in the restricted address space of a single
> process.
> 
> Various mechanisms are provided to annotate different types of memory (static,
> buddy allocator, slab, vmalloc etc.) as globally or locally non-sensitive. In
> addition, the ASI infrastructure takes care to ensure that different classes of
> memory do not share the same physical page. This includes separation of
> sensitive, globally non-sensitive and locally non-sensitive memory into
> different pages and also separation of locally non-sensitive memory for
> different processes into different pages as well.
> 
> What exactly should be considered non-sensitive (either globally or locally) is
> somewhat open-ended. Some things are clearly sensitive or non-sensitive, but
> many things also fall into a gray area, depending on how paranoid one wants to
> be. For this proof of concept, we have generally treated such things as
> non-sensitive, though that may not necessarily be the ideal classification in
> each case. Similarly, there is also a gray area between globally and locally
> non-sensitive classifications in some cases, and in those cases this RFC has
> mostly erred on the side of marking them as locally non-sensitive, even though
> many of those cases could likely be safely classified as globally non-sensitive.
> 
> Although this implementation includes fairly extensive support for marking most
> types of dynamically allocated memory as locally non-sensitive, it is possibly
> feasible, at least for KVM-ASI, to get away with a simpler implementation (such
> as [5]), if we are very selective about what memory we treat as locally
> non-sensitive (as opposed to globally non-sensitive). Nevertheless, the more
> general mechanism is included in this proof of concept as an illustration for
> what could be done if we really needed to treat any arbitrary kernel memory as
> locally non-sensitive.
> 
> It is also possible to have ASI classes that do not utilize the above described
> infrastructure and instead manage all the memory mappings inside the restricted
> address space on their own.
> 
> 
> References
> ==========
> [1] https://lore.kernel.org/lkml/1557758315-12667-1-git-send-email-alexandre.chartre@oracle.com
> [2] https://lore.kernel.org/lkml/1562855138-19507-1-git-send-email-alexandre.chartre@oracle.com
> [3] https://lore.kernel.org/lkml/1582734120-26757-1-git-send-email-alexandre.chartre@oracle.com
> [4] https://lore.kernel.org/lkml/20200504144939.11318-1-alexandre.chartre@oracle.com
> [5] https://lore.kernel.org/lkml/20190612170834.14855-1-mhillenb@amazon.de
> [6] https://lore.kernel.org/lkml/20200815031908.1015049-1-joel@joelfernandes.org
>

Thomas Gleixner March 16, 2022, 10:49 p.m. UTC | #3

Junaid,

On Tue, Feb 22 2022 at 21:21, Junaid Shahid wrote:
>
> The patches apply on top of Linux v5.16.

Why are you posting patches against some randomly chosen release?

Documentation/process/ is pretty clear about how this works. It's not
optional.

> These patches are also available via 
> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.

This is useful because?

If you want to provide patches in a usable form then please expose them
as git tree which can be pulled and not via the random tool of the day.

Thanks,

        tglx

Junaid Shahid March 17, 2022, 9:24 p.m. UTC | #4

On 3/16/22 15:49, Thomas Gleixner wrote:
> Junaid,
> 
> On Tue, Feb 22 2022 at 21:21, Junaid Shahid wrote:
>>
>> The patches apply on top of Linux v5.16.
> 
> Why are you posting patches against some randomly chosen release?
> 
> Documentation/process/ is pretty clear about how this works. It's not
> optional.

Sorry, I assumed that for an RFC, it may be acceptable to base on the last release version, but looks like I guessed wrong. I will base the next version of the RFC on the HEAD of the Linus tree.

> 
>> These patches are also available via
>> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.
> 
> This is useful because?
> 
> If you want to provide patches in a usable form then please expose them
> as git tree which can be pulled and not via the random tool of the day.

The patches are now available as the branch "asi-rfc-v1" in the git repo https://github.com/googleprodkernel/linux-kvm.git

Thanks,
Junaid

> 
> Thanks,
> 
>          tglx
> 
>

Junaid Shahid March 17, 2022, 11:25 p.m. UTC | #5

Hi Alex,

On 3/16/22 14:34, Alexandre Chartre wrote:
> 
> Hi Junaid,
> 
> On 2/23/22 06:21, Junaid Shahid wrote:
>> This patch series is a proof-of-concept RFC for an end-to-end implementation of
>> Address Space Isolation for KVM. It has similar goals and a somewhat similar
>> high-level design as the original ASI patches from Alexandre Chartre
>> ([1],[2],[3],[4]), but with a different underlying implementation. This also
>> includes several memory management changes to help with differentiating between
>> sensitive and non-sensitive memory and mapping of non-sensitive memory into the
>> ASI restricted address spaces.
>>
>> This RFC is intended as a demonstration of what a full ASI implementation for
>> KVM could look like, not necessarily as a direct proposal for what might
>> eventually be merged. In particular, these patches do not yet implement KPTI on
>> top of ASI, although the framework is generic enough to be able to support it.
>> Similarly, these patches do not include non-sensitive annotations for data
>> structures that did not get frequently accessed during execution of our test
>> workloads, but the framework is designed such that new non-sensitive memory
>> annotations can be added trivially.
>>
>> The patches apply on top of Linux v5.16. These patches are also available via
>> gerrit at https://linux-review.googlesource.com/q/topic:asi-rfc.
>>
> Sorry for the late answer, and thanks for investigating possible ASI
> implementations. I have to admit I put ASI on the back-burner for
> a while because I am more and more wondering if the complexity of
> ASI is worth the benefit, especially given challenges to effectively
> exploit flaws that ASI is expected to mitigate, in particular when VMs
> are running on dedicated cpu cores, or when core scheduling is used.
> So I have been looking at a more simplistic approach (see below, A
> Possible Alternative to ASI).
> 
> But first, your implementation confirms that KVM-ASI can be broken up
> into different parts: pagetable management, ASI core and sibling cpus
> synchronization.
> 
> Pagetable Management
> ====================
> For ASI, we need to build a pagetable with a subset of the kernel
> pagetable mappings. Your solution is interesting as it is provides
> a broad solution and also works well with dynamic allocations (while
> my approach to copy mappings had several limitations). The drawback
> is the extend of your changes which spread over all the mm code
> (while the simple solution to copy mappings can be done with a few
> self-contained independent functions).
> 
> ASI Core
> ========
> 
> KPTI
> ----
> Implementing KPTI with ASI is possible but this is not straight forward.
> This requires some special handling in particular in the assembly kernel
> entry/exit code for syscall, interrupt and exception (see ASI RFC v4 [4]
> as an example) because we are also switching privilege level in addition
> of switching the pagetable. So this might be something to consider early
> in your implementation to ensure it is effectively compatible with KPTI.

Yes, I will look in more detail into how to implement KPTI on top of this ASI implementation, but at least at a high level, it seems that it should work. Of course, the devil is always in the details :)

> 
> Going beyond KPTI (with a KPTI-next) and trying to execute most
> syscalls/interrupts without switching to the full kernel address space
> is more challenging, because it would require much more kernel mapping
> in the user pagetable, and this would basically defeat the purpose of
> KPTI. You can refer to discussions about the RFC to defer CR3 switch
> to C code [7] which was an attempt to just reach the kernel entry C
> code with a KPTI pagetable.

In principle, the ASI restricted address space would not contain any sensitive data, so having more mappings should be ok as long as they really are non-sensitive. Of course, it is possible that we may mistakenly think that some data is not sensitive and mark it as such, but in reality it really was sensitive in some way. In that sense, a strict KPTI is certainly a little more secure than the KPTI-Next that I mentioned, but KPTI-Next would also have lower performance overhead compared to the strict KPTI.

> 
> Interrupts/Exceptions
> ---------------------
> As long as interrupts/exceptions are not expected to be processed with
> ASI, it is probably better to explicitly exit ASI before processing an
> interrupt/exception, otherwise you will have an extra overhead on each
> interrupt/exception to take a page fault and then exit ASI.

I agree that for those interrupts/exceptions that will need to access sensitive data, it is better to do an explicit ASI Exit at the start. But it is probably possible for many interrupts to be handled without needing to access sensitive data, in which case, it would be better to avoid the ASI Exit.

> 
> This is particularily true if you have want to have KPTI use ASI, and
> in that case the ASI exit will need to be done early in the interrupt
> and exception assembly entry code.
> 
> ASI Hooks
> ---------
> ASI hooks are certainly a good idea to perform specific actions on ASI
> enter or exit. However, I am not sure they are appropriate places for cpus
> stunning with KVM-ASI. That's because cpus stunning doesn't need to be
> done precisely when entering and exiting ASI, and it probably shouldn't be
> done there: it should be done right before VMEnter and right after VMExit
> (see below).
> 

I believe that performing sibling CPU stunning right after VM Exit will negate most of the performance advantage of ASI. I think that it is feasible to do the stunning on ASI Exit. Please see below for how we handle the problem that you have mentioned.

> Sibling CPUs Synchronization
> ============================
> KVM-ASI requires the synchronization of sibling CPUs from the same CPU
> core so that when a VM is running then sibling CPUs are running with the
> ASI associated with this VM (or an ASI compatible with the VM, depending
> on how ASI is defined). That way the VM can only spy on data from ASI
> and won't be able to access any sensitive data.
> 
> So, right before entering a VM, KVM should ensures that sibling CPUs are
> using ASI. If a sibling CPU is not using ASI then KVM can either wait for
> that sibling to run ASI, or force it to use ASI (or to become idle).
> This behavior should be enforced as long as any sibling is running the
> VM. When all siblings are not running the VM then other siblings can run
> any code (using ASI or not).
> 
> I would be interesting to see the code you use to achieve this, because
> I don't get how this is achieved from the description of your sibling
> hyperthread stun and unstun mechanism.
> 
> Note that this synchronization is critical for ASI to work, in particular
> when entering the VM, we need to be absolutely sure that sibling CPUs are
> effectively using ASI. The core scheduling sibling stunning code you
> referenced [6] uses a mechanism which is fine for userspace synchronization
> (the delivery of the IPI forces the sibling to immediately enter the kernel)
> but this won't work for ASI as the delivery of the IPI won't guarantee that
> the sibling as enter ASI yet. I did some experiments that show that data
> will leak if siblings are not perfectly synchronized.

I agree that it is not secure to run one sibling in the unrestricted kernel address space while the other sibling is running in an ASI restricted address space, without doing a cache flush before re-entering the VM. However, I think that avoiding this situation does not require doing a sibling stun operation immediately after VM Exit. The way we avoid it is as follows.

First, we always use ASI in conjunction with core scheduling. This means that if HT0 is running a VCPU thread, then HT1 will be running either a VCPU thread of the same VM or the Idle thread. If it is running a VCPU thread, then if/when that thread takes a VM Exit, it will also be running in the same ASI restricted address space. For the idle thread, we have created another ASI Class, called Idle-ASI, which maps only globally non-sensitive kernel memory. The idle loop enters this ASI address space.

This means that when HT0 does a VM Exit, HT1 will either be running the guest code of a VCPU of the same VM, or it will be running kernel code in either a KVM-ASI or the Idle-ASI address space. (If HT1 is already running in the full kernel address space, that would imply that it had previously done an ASI Exit, which would have triggered a stun_sibling, which would have already caused HT0 to exit the VM and wait in the kernel).

If HT1 now does an ASI Exit, that will trigger the stun_sibling() operation in its pre_asi_exit() handler, which will set the state of the core/HT0 to Stunned (and possibly send an IPI too, though that will be ignored if HT0 was already in kernel mode). Now when HT0 tries to re-enter the VM, since its state is set to Stunned, it will just wait in a loop until HT1 does an unstun_sibling() operation, which it will do in its post_asi_enter handler the next time it does an ASI Enter (which would be either just before VM Enter if it was KVM-ASI, or in the next iteration of the idle loop if it was Idle-ASI). In either case, HT1's post_asi_enter() handler would also do a flush_sensitive_cpu_state operation before the unstun_sibling(), so when HT0 gets out of its wait-loop and does a VM Enter, there will not be any sensitive state left.

One thing that probably was not clear from the patch, is that the stun state check and wait-loop is still always executed before VM Enter, even if no ASI Exit happened in that execution.

> 
> A Possible Alternative to ASI?
> =============================
> ASI prevents access to sensitive data by unmapping them. On the other
> hand, the KVM code somewhat already identifies access to sensitive data
> as part of the L1TF/MDS mitigation, and when KVM is about to access
> sensitive data then it sets l1tf_flush_l1d to true (so that L1D gets
> flushed before VMEnter).
> 
> With KVM knowing when it accesses sensitive data, I think we can provide
> the same mitigation as ASI by simply allowing KVM code which doesn't
> access sensitive data to be run concurrently with a VM. This can be done
> by tagging the kernel thread when it enters KVM code which doesn't
> access sensitive data, and untagging the thread right before it accesses
> sensitive data. And when KVM is about to do a VMEnter then we synchronize
> siblings CPUs so that they run threads with the same tag. Sounds familar?
> Yes, because that's similar to core scheduling but inside the kernel
> (let's call it "kernel core scheduling").
> 
> I think the benefit of this approach would be that it should be much
> simpler to implement and less invasive than ASI, and it doesn't preclude
> to later do ASI: ASI can be done in addition and provide an extra level
> of mitigation in case some sensitive is still accessed by KVM. Also it
> would provide the critical sibling CPU synchronization mechanism that
> we also need with ASI.
> 
> I did some prototyping to implement this kernel core scheduling a while
> ago (and then get diverted on other stuff) but so far performances have
> been abyssal especially when doing a strict synchronization between
> sibling CPUs. I am planning go back and do more investigations when I
> have cycles but probably not that soon.
> 

This also seems like an interesting approach. It does have some different trade-offs compared to ASI. First, there is the trade-off between a blacklist vs. whitelist approach. Secondly, ASI has a more structured approach based on the sensitivity of the data itself, instead of having to audit every new code path to verify whether or not it can potentially access any sensitive data. On the other hand, as you point out, this approach is much simpler than ASI, which is certainly a plus.

Thanks,
Junaid

Alexandre Chartre March 22, 2022, 9:46 a.m. UTC | #6

On 3/18/22 00:25, Junaid Shahid wrote:
>> ASI Core
>> ========
>>
>> KPTI
>> ----
>> Implementing KPTI with ASI is possible but this is not straight forward.
>> This requires some special handling in particular in the assembly kernel
>> entry/exit code for syscall, interrupt and exception (see ASI RFC v4 [4]
>> as an example) because we are also switching privilege level in addition
>> of switching the pagetable. So this might be something to consider early
>> in your implementation to ensure it is effectively compatible with KPTI.
> 
> Yes, I will look in more detail into how to implement KPTI on top of
> this ASI implementation, but at least at a high level, it seems that
> it should work. Of course, the devil is always in the details :)
> 
>>
>> Going beyond KPTI (with a KPTI-next) and trying to execute most
>> syscalls/interrupts without switching to the full kernel address space
>> is more challenging, because it would require much more kernel mapping
>> in the user pagetable, and this would basically defeat the purpose of
>> KPTI. You can refer to discussions about the RFC to defer CR3 switch
>> to C code [7] which was an attempt to just reach the kernel entry C
>> code with a KPTI pagetable.
> 
> In principle, the ASI restricted address space would not contain any
> sensitive data, so having more mappings should be ok as long as they
> really are non-sensitive. Of course, it is possible that we may
> mistakenly think that some data is not sensitive and mark it as such,
> but in reality it really was sensitive in some way. In that sense, a
> strict KPTI is certainly a little more secure than the KPTI-Next that
> I mentioned, but KPTI-Next would also have lower performance overhead
> compared to the strict KPTI.
> 

Mappings are precisely the issue for KPTI-next. The RFC I submitted show
that going beyond KPTI might require to map data which could be deemed
sensitive. Also they are extra complications that make it difficult to reach
C code with a KPTI page-table. This was discussed in v2 of the "Defer CR3
switch to C code" RFC:
https://lore.kernel.org/all/20201116144757.1920077-1-alexandre.chartre@oracle.com/


>>
>> Interrupts/Exceptions
>> ---------------------
>> As long as interrupts/exceptions are not expected to be processed with
>> ASI, it is probably better to explicitly exit ASI before processing an
>> interrupt/exception, otherwise you will have an extra overhead on each
>> interrupt/exception to take a page fault and then exit ASI.
> 
> I agree that for those interrupts/exceptions that will need to access
> sensitive data, it is better to do an explicit ASI Exit at the start.
> But it is probably possible for many interrupts to be handled without
> needing to access sensitive data, in which case, it would be better
> to avoid the ASI Exit.
> 
>>
>> This is particularily true if you have want to have KPTI use ASI, and
>> in that case the ASI exit will need to be done early in the interrupt
>> and exception assembly entry code.
>>
>> ASI Hooks
>> ---------
>> ASI hooks are certainly a good idea to perform specific actions on ASI
>> enter or exit. However, I am not sure they are appropriate places for cpus
>> stunning with KVM-ASI. That's because cpus stunning doesn't need to be
>> done precisely when entering and exiting ASI, and it probably shouldn't be
>> done there: it should be done right before VMEnter and right after VMExit
>> (see below).
>>
> 
> I believe that performing sibling CPU stunning right after VM Exit
> will negate most of the performance advantage of ASI. I think that it
> is feasible to do the stunning on ASI Exit. Please see below for how
> we handle the problem that you have mentioned.
> 

Right, I was confused by what you exactly meant by cpu stun/unstun but
I think it's now clearer with your explanation below.


> 
>> Sibling CPUs Synchronization
>> ============================
>> KVM-ASI requires the synchronization of sibling CPUs from the same CPU
>> core so that when a VM is running then sibling CPUs are running with the
>> ASI associated with this VM (or an ASI compatible with the VM, depending
>> on how ASI is defined). That way the VM can only spy on data from ASI
>> and won't be able to access any sensitive data.
>>
>> So, right before entering a VM, KVM should ensures that sibling CPUs are
>> using ASI. If a sibling CPU is not using ASI then KVM can either wait for
>> that sibling to run ASI, or force it to use ASI (or to become idle).
>> This behavior should be enforced as long as any sibling is running the
>> VM. When all siblings are not running the VM then other siblings can run
>> any code (using ASI or not).
>>
>> I would be interesting to see the code you use to achieve this, because
>> I don't get how this is achieved from the description of your sibling
>> hyperthread stun and unstun mechanism.
>>
>> Note that this synchronization is critical for ASI to work, in particular
>> when entering the VM, we need to be absolutely sure that sibling CPUs are
>> effectively using ASI. The core scheduling sibling stunning code you
>> referenced [6] uses a mechanism which is fine for userspace synchronization
>> (the delivery of the IPI forces the sibling to immediately enter the kernel)
>> but this won't work for ASI as the delivery of the IPI won't guarantee that
>> the sibling as enter ASI yet. I did some experiments that show that data
>> will leak if siblings are not perfectly synchronized.
> 
> I agree that it is not secure to run one sibling in the unrestricted
> kernel address space while the other sibling is running in an ASI
> restricted address space, without doing a cache flush before
> re-entering the VM. However, I think that avoiding this situation
> does not require doing a sibling stun operation immediately after VM
> Exit. The way we avoid it is as follows.
> 
> First, we always use ASI in conjunction with core scheduling. This
> means that if HT0 is running a VCPU thread, then HT1 will be running
> either a VCPU thread of the same VM or the Idle thread. If it is
> running a VCPU thread, then if/when that thread takes a VM Exit, it
> will also be running in the same ASI restricted address space. For
> the idle thread, we have created another ASI Class, called Idle-ASI,
> which maps only globally non-sensitive kernel memory. The idle loop
> enters this ASI address space.
> 
> This means that when HT0 does a VM Exit, HT1 will either be running
> the guest code of a VCPU of the same VM, or it will be running kernel
> code in either a KVM-ASI or the Idle-ASI address space. (If HT1 is
> already running in the full kernel address space, that would imply
> that it had previously done an ASI Exit, which would have triggered a
> stun_sibling, which would have already caused HT0 to exit the VM and
> wait in the kernel).

Note that using core scheduling (or not) is a detail, what is important
is whether HT are running with ASI or not. Running core scheduling will
just improve chances to have all siblings run ASI at the same time
and so improve ASI performances.


> If HT1 now does an ASI Exit, that will trigger the stun_sibling()
> operation in its pre_asi_exit() handler, which will set the state of
> the core/HT0 to Stunned (and possibly send an IPI too, though that
> will be ignored if HT0 was already in kernel mode). Now when HT0
> tries to re-enter the VM, since its state is set to Stunned, it will
> just wait in a loop until HT1 does an unstun_sibling() operation,
> which it will do in its post_asi_enter handler the next time it does
> an ASI Enter (which would be either just before VM Enter if it was
> KVM-ASI, or in the next iteration of the idle loop if it was
> Idle-ASI). In either case, HT1's post_asi_enter() handler would also
> do a flush_sensitive_cpu_state operation before the unstun_sibling(),
> so when HT0 gets out of its wait-loop and does a VM Enter, there will
> not be any sensitive state left.
> 
> One thing that probably was not clear from the patch, is that the
> stun state check and wait-loop is still always executed before VM
> Enter, even if no ASI Exit happened in that execution.
> 

So if I understand correctly, you have following sequence:

0 - Initially state is set to "stunned" for all cpus (i.e. a cpu should
     wait before VMEnter)

1 - After ASI Enter: Set sibling state to "unstunned" (i.e. sibling can
     do VMEnter)

2 - Before VMEnter : wait while my state is "stunned"

3 - Before ASI Exit : Set sibling state to "stunned" (i.e. sibling should
     wait before VMEnter)

I have tried this kind of implementation, and the problem is with step 2
(wait while my state is "stunned"); how do you wait exactly? You can't
just do an active wait otherwise you have all kind of problems (depending
if you have interrupts enabled or not) especially as you don't know how
long you have to wait for (this depends on what the other cpu is doing).

That's why I have been dissociating ASI and cpu stunning (and eventually
move to only do kernel core scheduling). Basically I replaced step 2 by
a call to the scheduler to select threads using ASI on all siblings (or
run something else if there's higher priority threads to run) which means
enabling kernel core scheduling at this point.

>>
>> A Possible Alternative to ASI?
>> =============================
>> ASI prevents access to sensitive data by unmapping them. On the other
>> hand, the KVM code somewhat already identifies access to sensitive data
>> as part of the L1TF/MDS mitigation, and when KVM is about to access
>> sensitive data then it sets l1tf_flush_l1d to true (so that L1D gets
>> flushed before VMEnter).
>>
>> With KVM knowing when it accesses sensitive data, I think we can provide
>> the same mitigation as ASI by simply allowing KVM code which doesn't
>> access sensitive data to be run concurrently with a VM. This can be done
>> by tagging the kernel thread when it enters KVM code which doesn't
>> access sensitive data, and untagging the thread right before it accesses
>> sensitive data. And when KVM is about to do a VMEnter then we synchronize
>> siblings CPUs so that they run threads with the same tag. Sounds familar?
>> Yes, because that's similar to core scheduling but inside the kernel
>> (let's call it "kernel core scheduling").
>>
>> I think the benefit of this approach would be that it should be much
>> simpler to implement and less invasive than ASI, and it doesn't preclude
>> to later do ASI: ASI can be done in addition and provide an extra level
>> of mitigation in case some sensitive is still accessed by KVM. Also it
>> would provide the critical sibling CPU synchronization mechanism that
>> we also need with ASI.
>>
>> I did some prototyping to implement this kernel core scheduling a while
>> ago (and then get diverted on other stuff) but so far performances have
>> been abyssal especially when doing a strict synchronization between
>> sibling CPUs. I am planning go back and do more investigations when I
>> have cycles but probably not that soon.
>>
> 
> This also seems like an interesting approach. It does have some
> different trade-offs compared to ASI. First, there is the trade-off
> between a blacklist vs. whitelist approach. Secondly, ASI has a more
> structured approach based on the sensitivity of the data itself,
> instead of having to audit every new code path to verify whether or
> not it can potentially access any sensitive data. On the other hand,
> as you point out, this approach is much simpler than ASI, which is
> certainly a plus.

I think the main benefit is that it provides a mechanism for running
specific kernel threads together on sibling cpus independently of ASI.
So it will be easier to implement (you don't need ASI) and to test.

Then, once this mechanism has proven to work (and to be efficient),
you can have KVM ASI use it.

alex.

Junaid Shahid March 23, 2022, 7:35 p.m. UTC | #7

On 3/22/22 02:46, Alexandre Chartre wrote:
> 
> On 3/18/22 00:25, Junaid Shahid wrote:
>>
>> I agree that it is not secure to run one sibling in the unrestricted
>> kernel address space while the other sibling is running in an ASI
>> restricted address space, without doing a cache flush before
>> re-entering the VM. However, I think that avoiding this situation
>> does not require doing a sibling stun operation immediately after VM
>> Exit. The way we avoid it is as follows.
>>
>> First, we always use ASI in conjunction with core scheduling. This
>> means that if HT0 is running a VCPU thread, then HT1 will be running
>> either a VCPU thread of the same VM or the Idle thread. If it is
>> running a VCPU thread, then if/when that thread takes a VM Exit, it
>> will also be running in the same ASI restricted address space. For
>> the idle thread, we have created another ASI Class, called Idle-ASI,
>> which maps only globally non-sensitive kernel memory. The idle loop
>> enters this ASI address space.
>>
>> This means that when HT0 does a VM Exit, HT1 will either be running
>> the guest code of a VCPU of the same VM, or it will be running kernel
>> code in either a KVM-ASI or the Idle-ASI address space. (If HT1 is
>> already running in the full kernel address space, that would imply
>> that it had previously done an ASI Exit, which would have triggered a
>> stun_sibling, which would have already caused HT0 to exit the VM and
>> wait in the kernel).
> 
> Note that using core scheduling (or not) is a detail, what is important
> is whether HT are running with ASI or not. Running core scheduling will
> just improve chances to have all siblings run ASI at the same time
> and so improve ASI performances.
> 
> 
>> If HT1 now does an ASI Exit, that will trigger the stun_sibling()
>> operation in its pre_asi_exit() handler, which will set the state of
>> the core/HT0 to Stunned (and possibly send an IPI too, though that
>> will be ignored if HT0 was already in kernel mode). Now when HT0
>> tries to re-enter the VM, since its state is set to Stunned, it will
>> just wait in a loop until HT1 does an unstun_sibling() operation,
>> which it will do in its post_asi_enter handler the next time it does
>> an ASI Enter (which would be either just before VM Enter if it was
>> KVM-ASI, or in the next iteration of the idle loop if it was
>> Idle-ASI). In either case, HT1's post_asi_enter() handler would also
>> do a flush_sensitive_cpu_state operation before the unstun_sibling(),
>> so when HT0 gets out of its wait-loop and does a VM Enter, there will
>> not be any sensitive state left.
>>
>> One thing that probably was not clear from the patch, is that the
>> stun state check and wait-loop is still always executed before VM
>> Enter, even if no ASI Exit happened in that execution.
>>
> 
> So if I understand correctly, you have following sequence:
> 
> 0 - Initially state is set to "stunned" for all cpus (i.e. a cpu should
>      wait before VMEnter)
> 
> 1 - After ASI Enter: Set sibling state to "unstunned" (i.e. sibling can
>      do VMEnter)
> 
> 2 - Before VMEnter : wait while my state is "stunned"
> 
> 3 - Before ASI Exit : Set sibling state to "stunned" (i.e. sibling should
>      wait before VMEnter)
> 
> I have tried this kind of implementation, and the problem is with step 2
> (wait while my state is "stunned"); how do you wait exactly? You can't
> just do an active wait otherwise you have all kind of problems (depending
> if you have interrupts enabled or not) especially as you don't know how
> long you have to wait for (this depends on what the other cpu is doing).

In our stunning implementation, we do an active wait with interrupts enabled and with a need_resched() check to decide when to bail out to the scheduler (plus we also make sure that we re-enter ASI at the end of the wait in case some interrupt exited ASI). What kind of problems have you run into with an active wait, besides wasted CPU cycles?

In any case, the specific stunning mechanism is orthogonal to ASI. This implementation of ASI can be integrated with different stunning implementations. The "kernel core scheduling" that you proposed is also an alternative to stunning and could be similarly integrated with ASI.

> 
> That's why I have been dissociating ASI and cpu stunning (and eventually
> move to only do kernel core scheduling). Basically I replaced step 2 by
> a call to the scheduler to select threads using ASI on all siblings (or
> run something else if there's higher priority threads to run) which means
> enabling kernel core scheduling at this point.
> 
>>>
>>> A Possible Alternative to ASI?
>>> =============================
>>> ASI prevents access to sensitive data by unmapping them. On the other
>>> hand, the KVM code somewhat already identifies access to sensitive data
>>> as part of the L1TF/MDS mitigation, and when KVM is about to access
>>> sensitive data then it sets l1tf_flush_l1d to true (so that L1D gets
>>> flushed before VMEnter).
>>>
>>> With KVM knowing when it accesses sensitive data, I think we can provide
>>> the same mitigation as ASI by simply allowing KVM code which doesn't
>>> access sensitive data to be run concurrently with a VM. This can be done
>>> by tagging the kernel thread when it enters KVM code which doesn't
>>> access sensitive data, and untagging the thread right before it accesses
>>> sensitive data. And when KVM is about to do a VMEnter then we synchronize
>>> siblings CPUs so that they run threads with the same tag. Sounds familar?
>>> Yes, because that's similar to core scheduling but inside the kernel
>>> (let's call it "kernel core scheduling").
>>>
>>> I think the benefit of this approach would be that it should be much
>>> simpler to implement and less invasive than ASI, and it doesn't preclude
>>> to later do ASI: ASI can be done in addition and provide an extra level
>>> of mitigation in case some sensitive is still accessed by KVM. Also it
>>> would provide the critical sibling CPU synchronization mechanism that
>>> we also need with ASI.
>>>
>>> I did some prototyping to implement this kernel core scheduling a while
>>> ago (and then get diverted on other stuff) but so far performances have
>>> been abyssal especially when doing a strict synchronization between
>>> sibling CPUs. I am planning go back and do more investigations when I
>>> have cycles but probably not that soon.
>>>
>>
>> This also seems like an interesting approach. It does have some
>> different trade-offs compared to ASI. First, there is the trade-off
>> between a blacklist vs. whitelist approach. Secondly, ASI has a more
>> structured approach based on the sensitivity of the data itself,
>> instead of having to audit every new code path to verify whether or
>> not it can potentially access any sensitive data. On the other hand,
>> as you point out, this approach is much simpler than ASI, which is
>> certainly a plus.
> 
> I think the main benefit is that it provides a mechanism for running
> specific kernel threads together on sibling cpus independently of ASI.
> So it will be easier to implement (you don't need ASI) and to test.
> 

It would be interesting to see the performance characteristics of this approach compared to stunning. I think it would really depend on how long do we typically end up staying in the full kernel address space when running VCPUs.

Note that stunning can also be implemented independently of ASI by integrating it with the same conditional L1TF mitigation mechanism (l1tf_flush_l1d) that currently exists in KVM. The way I see it, this kernel core scheduling is an alternative to stunning, regardless of whether we integrate it with ASI or with the existing conditional mitigation mechanism.

> Then, once this mechanism has proven to work (and to be efficient),
> you can have KVM ASI use it.
> 

Yes, if this mechanism seems to work better than stunning, then we could certainly integrate this with ASI. Though it is possible that we may end up needing ASI to get to the "efficient" part.

Thanks,
Junaid

Alexandre Chartre April 8, 2022, 8:52 a.m. UTC | #8

On 3/23/22 20:35, Junaid Shahid wrote:
> On 3/22/22 02:46, Alexandre Chartre wrote:
>> 
>> On 3/18/22 00:25, Junaid Shahid wrote:
>>> 
>>> I agree that it is not secure to run one sibling in the
>>> unrestricted kernel address space while the other sibling is
>>> running in an ASI restricted address space, without doing a cache
>>> flush before re-entering the VM. However, I think that avoiding
>>> this situation does not require doing a sibling stun operation
>>> immediately after VM Exit. The way we avoid it is as follows.
>>> 
>>> First, we always use ASI in conjunction with core scheduling.
>>> This means that if HT0 is running a VCPU thread, then HT1 will be
>>> running either a VCPU thread of the same VM or the Idle thread.
>>> If it is running a VCPU thread, then if/when that thread takes a
>>> VM Exit, it will also be running in the same ASI restricted
>>> address space. For the idle thread, we have created another ASI
>>> Class, called Idle-ASI, which maps only globally non-sensitive
>>> kernel memory. The idle loop enters this ASI address space.
>>> 
>>> This means that when HT0 does a VM Exit, HT1 will either be
>>> running the guest code of a VCPU of the same VM, or it will be
>>> running kernel code in either a KVM-ASI or the Idle-ASI address
>>> space. (If HT1 is already running in the full kernel address
>>> space, that would imply that it had previously done an ASI Exit,
>>> which would have triggered a stun_sibling, which would have
>>> already caused HT0 to exit the VM and wait in the kernel).
>> 
>> Note that using core scheduling (or not) is a detail, what is
>> important is whether HT are running with ASI or not. Running core
>> scheduling will just improve chances to have all siblings run ASI
>> at the same time and so improve ASI performances.
>> 
>> 
>>> If HT1 now does an ASI Exit, that will trigger the
>>> stun_sibling() operation in its pre_asi_exit() handler, which
>>> will set the state of the core/HT0 to Stunned (and possibly send
>>> an IPI too, though that will be ignored if HT0 was already in
>>> kernel mode). Now when HT0 tries to re-enter the VM, since its
>>> state is set to Stunned, it will just wait in a loop until HT1
>>> does an unstun_sibling() operation, which it will do in its
>>> post_asi_enter handler the next time it does an ASI Enter (which
>>> would be either just before VM Enter if it was KVM-ASI, or in the
>>> next iteration of the idle loop if it was Idle-ASI). In either
>>> case, HT1's post_asi_enter() handler would also do a
>>> flush_sensitive_cpu_state operation before the unstun_sibling(), 
>>> so when HT0 gets out of its wait-loop and does a VM Enter, there
>>> will not be any sensitive state left.
>>> 
>>> One thing that probably was not clear from the patch, is that
>>> the stun state check and wait-loop is still always executed
>>> before VM Enter, even if no ASI Exit happened in that execution.
>>> 
>> 
>> So if I understand correctly, you have following sequence:
>> 
>> 0 - Initially state is set to "stunned" for all cpus (i.e. a cpu
>> should wait before VMEnter)
>> 
>> 1 - After ASI Enter: Set sibling state to "unstunned" (i.e. sibling
>> can do VMEnter)
>> 
>> 2 - Before VMEnter : wait while my state is "stunned"
>> 
>> 3 - Before ASI Exit : Set sibling state to "stunned" (i.e. sibling
>> should wait before VMEnter)
>> 
>> I have tried this kind of implementation, and the problem is with
>> step 2 (wait while my state is "stunned"); how do you wait exactly?
>> You can't just do an active wait otherwise you have all kind of
>> problems (depending if you have interrupts enabled or not)
>> especially as you don't know how long you have to wait for (this
>> depends on what the other cpu is doing).
> 
> In our stunning implementation, we do an active wait with interrupts 
> enabled and with a need_resched() check to decide when to bail out
> to the scheduler (plus we also make sure that we re-enter ASI at the
> end of the wait in case some interrupt exited ASI). What kind of
> problems have you run into with an active wait, besides wasted CPU
> cycles?

If you wait with interrupts enabled then there is window after the
wait and before interrupts get disabled where a cpu can get an interrupt,
exit ASI while the sibling is entering the VM. Also after a CPU has passed
the wait and have disable interrupts, it can't be notified if the sibling
has exited ASI:

T+01 - cpu A and B enter ASI - interrupts are enabled
T+02 - cpu A and B pass the wait because both are using ASI - interrupts are enabled
T+03 - cpu A gets an interrupt
T+04 - cpu B disables interrupts
T+05 - cpu A exit ASI and process interrupts
T+06 - cpu B enters VM  => cpu B runs VM while cpu A is not using ASI
T+07 - cpu B exits VM
T+08 - cpu B exits ASI
T+09 - cpu A returns from interrupt
T+10 - cpu A disables interrupts and enter VM => cpu A runs VM while cpu A is not using ASI


> In any case, the specific stunning mechanism is orthogonal to ASI.
> This implementation of ASI can be integrated with different stunning
> implementations. The "kernel core scheduling" that you proposed is
> also an alternative to stunning and could be similarly integrated
> with ASI.

Yes, but for ASI to be relevant with KVM to prevent data leak, you need
a fully functional and reliable stunning mechanism, otherwise ASI is
useless. That's why I think it is better to first focus on having an
effective stunning mechanism and then implement ASI.


alex.

junaid_shahid@hotmail.com April 11, 2022, 3:26 a.m. UTC | #9

Hi Alex,

> On 3/23/22 20:35, Junaid Shahid wrote:
>> On 3/22/22 02:46, Alexandre Chartre wrote:
>>> 
>>> So if I understand correctly, you have following sequence:
>>> 
>>> 0 - Initially state is set to "stunned" for all cpus (i.e. a cpu
>>> should wait before VMEnter)
>>> 
>>> 1 - After ASI Enter: Set sibling state to "unstunned" (i.e. sibling
>>> can do VMEnter)
>>> 
>>> 2 - Before VMEnter : wait while my state is "stunned"
>>> 
>>> 3 - Before ASI Exit : Set sibling state to "stunned" (i.e. sibling
>>> should wait before VMEnter)
>>> 
>>> I have tried this kind of implementation, and the problem is with
>>> step 2 (wait while my state is "stunned"); how do you wait exactly?
>>> You can't just do an active wait otherwise you have all kind of
>>> problems (depending if you have interrupts enabled or not)
>>> especially as you don't know how long you have to wait for (this
>>> depends on what the other cpu is doing).
>> 
>> In our stunning implementation, we do an active wait with interrupts 
>> enabled and with a need_resched() check to decide when to bail out
>> to the scheduler (plus we also make sure that we re-enter ASI at the
>> end of the wait in case some interrupt exited ASI). What kind of
>> problems have you run into with an active wait, besides wasted CPU
>> cycles?
>
> If you wait with interrupts enabled then there is window after the
> wait and before interrupts get disabled where a cpu can get an interrupt,
> exit ASI while the sibling is entering the VM.

We actually do another check after disabling interrupts and if it turns out
that we need to wait again, we just go back to the wait loop after re-enabling
interrupts. But, irrespective of that,

> Also after a CPU has passed
> the wait and have disable interrupts, it can't be notified if the sibling
> has exited ASI:

I don't think that this is actually the case. Yes, the IPI from the sibling
will be blocked while the host kernel has disabled interrupts. However, when
the host kernel executes a VMENTER, if there is a pending IPI, the VM will
immediately exit back to the host even before executing any guest code. So
AFAICT there is not going to be any data leak in the scenario that you
mentioned. Basically, the "cpu B runs VM" in step T+06 won't actually happen.

> 
> T+01 - cpu A and B enter ASI - interrupts are enabled
> T+02 - cpu A and B pass the wait because both are using ASI - interrupts are enabled
> T+03 - cpu A gets an interrupt
> T+04 - cpu B disables interrupts
> T+05 - cpu A exit ASI and process interrupts
> T+06 - cpu B enters VM  => cpu B runs VM while cpu A is not using ASI
> T+07 - cpu B exits VM
> T+08 - cpu B exits ASI
> T+09 - cpu A returns from interrupt
> T+10 - cpu A disables interrupts and enter VM => cpu A runs VM while cpu A is not using ASI

The "cpu A runs VM while cpu A is not using ASI" will also not happen, because
cpu A will re-enter ASI after disabling interrupts and before entering the VM.

> 
>> In any case, the specific stunning mechanism is orthogonal to ASI.
>> This implementation of ASI can be integrated with different stunning
>> implementations. The "kernel core scheduling" that you proposed is
>> also an alternative to stunning and could be similarly integrated
>> with ASI.
>
> Yes, but for ASI to be relevant with KVM to prevent data leak, you need
> a fully functional and reliable stunning mechanism, otherwise ASI is
> useless. That's why I think it is better to first focus on having an
> effective stunning mechanism and then implement ASI.
> 

Sure, that makes sense. The only caveat is that, at least in our testing, the
overhead of stunning alone without ASI seemed too high. But I can try to see
if we might be able to post our stunning implementation with the next version
of the RFC.

Thanks,
Junaid

PS: I am away from the office for a few weeks, so email replies may be delayed
until next month.

[RFC,00/47] Address Space Isolation for KVM

Message

Comments