[v18,117/121] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU

Message ID	3dcf883d75158bf5c1a6e62e1b60f3882022f603.1705965635.git.isaku.yamahata@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BF54C82D77; Mon, 22 Jan 2024 23:56:19 +0000 (UTC) From: isaku.yamahata@intel.com To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: isaku.yamahata@intel.com, isaku.yamahata@gmail.com, Paolo Bonzini <pbonzini@redhat.com>, erdemaktas@google.com, Sean Christopherson <seanjc@google.com>, Sagi Shahar <sagis@google.com>, Kai Huang <kai.huang@intel.com>, chen.bo@intel.com, hang.yuan@intel.com, tina.zhang@intel.com Subject: [PATCH v18 117/121] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU Date: Mon, 22 Jan 2024 15:54:33 -0800 Message-Id: <3dcf883d75158bf5c1a6e62e1b60f3882022f603.1705965635.git.isaku.yamahata@intel.com> In-Reply-To: <cover.1705965634.git.isaku.yamahata@intel.com> References: <cover.1705965634.git.isaku.yamahata@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	KVM TDX basic feature support \| expand [v18,000/121] KVM TDX basic feature support [v18,001/121] x86/virt/tdx: Export TDX KeyID information [v18,002/121] x86/virt/tdx: Export SEAMCALL functions [v18,003/121] KVM: x86: Add is_vm_type_supported callback [v18,004/121] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX [v18,005/121] KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_hardware_setup() [v18,006/121] KVM: x86/vmx: Refactor KVM VMX module init/exit functions [v18,007/121] KVM: VMX: Reorder vmx initialization with kvm vendor initialization [v18,008/121] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module [v18,009/121] KVM: TDX: Add placeholders for TDX VM/vcpu structure [v18,010/121] KVM: TDX: Make TDX VM type supported [v18,011/121,MARKER] The start of TDX KVM patch series: TDX architectural definitions [v18,012/121] KVM: TDX: Define TDX architectural definitions [v18,013/121] KVM: TDX: Add TDX "architectural" error codes [v18,014/121] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module [v18,015/121] KVM: TDX: Retry SEAMCALL on the lack of entropy error [v18,016/121] KVM: TDX: Add helper functions to print TDX SEAMCALL error [v18,017/121,MARKER] The start of TDX KVM patch series: TD VM creation/destruction [v18,018/121] KVM: TDX: Add helper functions to allocate/free TDX private host key id [v18,019/121] KVM: TDX: Add helper function to read TDX metadata in array [v18,020/121] x86/virt/tdx: Get system-wide info about TDX module on initialization [v18,021/121] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl [v18,022/121] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters [v18,023/121] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific [v18,024/121] KVM: TDX: create/destroy VM structure [v18,025/121] KVM: TDX: initialize VM with TDX specific parameters [v18,026/121] KVM: TDX: Make pmu_intel.c ignore guest TD case [v18,027/121] KVM: TDX: Refuse to unplug the last cpu on the package [v18,028/121,MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction [v18,029/121] KVM: TDX: create/free TDX vcpu structure [v18,030/121] KVM: TDX: Do TDX specific vcpu initialization [v18,031/121,MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits [v18,032/121] KVM: x86/mmu: introduce config for PRIVATE KVM MMU [v18,033/121] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA [v18,034/121,MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX [v18,035/121] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values [v18,036/121] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE [v18,037/121] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE [v18,038/121] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask/shadow_present_mask [v18,039/121] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis [v18,040/121] KVM: x86/mmu: Disallow fast page fault on private GPA [v18,041/121] KVM: x86/mmu: Allow per-VM override of the TDP max page level [v18,042/121] KVM: VMX: Introduce test mode related to EPT violation VE [v18,043/121,MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks [v18,044/121] KVM: x86/mmu: Assume guest MMIOs are shared [v18,045/121] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation [v18,046/121] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role [v18,047/121] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page [v18,048/121] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases [v18,049/121] KVM: x86/tdp_mmu: Apply mmu notifier callback to only shared GPA [v18,050/121] KVM: x86/tdp_mmu: Sprinkle __must_check [v18,051/121] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU [v18,052/121,MARKER] The start of TDX KVM patch series: TDX EPT violation [v18,053/121] KVM: x86/mmu: TDX: Do not enable page track for TD guest [v18,054/121] KVM: VMX: Split out guts of EPT violation to common/exposed function [v18,055/121] KVM: VMX: Move setting of EPT MMU masks to common VT-x code [v18,056/121] KVM: TDX: Add accessors VMX VMCS helpers [v18,057/121] KVM: TDX: Add load_mmu_pgd method for TDX [v18,058/121] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT [v18,059/121] KVM: TDX: Require TDP MMU and mmio caching for TDX [v18,060/121] KVM: TDX: TDP MMU TDX support [v18,061/121] KVM: TDX: MTRR: implement get_mt_mask() for TDX [v18,062/121,MARKER] The start of TDX KVM patch series: TD finalization [v18,063/121] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX [v18,064/121] KVM: TDX: Create initial guest memory [v18,065/121] KVM: TDX: Finalize VM initialization [v18,066/121,MARKER] The start of TDX KVM patch series: TD vcpu enter/exit [v18,067/121] KVM: TDX: Implement TDX vcpu enter/exit path [v18,068/121] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) [v18,069/121] KVM: TDX: restore host xsave state when exit from the guest TD [v18,070/121] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr [v18,071/121] KVM: TDX: restore user ret MSRs [v18,072/121] KVM: TDX: Add TSX_CTRL msr into uret_msrs list [v18,073/121,MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls [v18,074/121] KVM: TDX: complete interrupts after tdexit [v18,075/121] KVM: TDX: restore debug store when TD exit [v18,076/121] KVM: TDX: handle vcpu migration over logical processor [v18,077/121] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior [v18,078/121] KVM: TDX: Add support for find pending IRQ in a protected local APIC [v18,079/121] KVM: x86: Assume timer IRQ was injected if APIC state is proteced [v18,080/121] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c [v18,081/121] KVM: TDX: Implement interrupt injection [v18,082/121] KVM: TDX: Implements vcpu request_immediate_exit [v18,083/121] KVM: TDX: Implement methods to inject NMI [v18,084/121] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument [v18,085/121] KVM: VMX: Move NMI/exception handler to common helper [v18,086/121] KVM: x86: Split core of hypercall emulation to helper function [v18,087/121] KVM: TDX: Add a place holder to handle TDX VM exit [v18,088/121] KVM: TDX: Handle vmentry failure for INTEL TD guest [v18,089/121] KVM: TDX: handle EXIT_REASON_OTHER_SMI [v18,090/121] KVM: TDX: handle ept violation/misconfig exit [v18,091/121] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT [v18,092/121] KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI [v18,093/121] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) [v18,094/121] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL [v18,095/121] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL [v18,096/121] KVM: TDX: Handle TDX PV CPUID hypercall [v18,097/121] KVM: TDX: Handle TDX PV HLT hypercall [v18,098/121] KVM: TDX: Handle TDX PV port io hypercall [v18,099/121] KVM: TDX: Handle TDX PV MMIO hypercall [v18,100/121] KVM: TDX: Implement callbacks for MSR operations for TDX [v18,101/121] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall [v18,102/121] KVM: TDX: Handle MSR MTRRCap and MTRRDefType access [v18,103/121] KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL [v18,104/121] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall [v18,105/121] KVM: TDX: Silently discard SMI request [v18,106/121] KVM: TDX: Silently ignore INIT/SIPI [v18,107/121] KVM: TDX: Add methods to ignore accesses to CPU state [v18,108/121] KVM: TDX: Add methods to ignore guest instruction emulation [v18,109/121] KVM: TDX: Add a method to ignore dirty logging [v18,110/121] KVM: TDX: Add methods to ignore VMX preemption timer [v18,111/121] KVM: TDX: Add methods to ignore accesses to TSC [v18,112/121] KVM: TDX: Ignore setting up mce [v18,113/121] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch [v18,114/121] KVM: TDX: Add methods to ignore virtual apic related operation [v18,115/121] KVM: TDX: Inhibit APICv for TDX guest [v18,116/121] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) [v18,117/121] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU [v18,118/121] KVM: TDX: Add hint TDX ioctl to release Secure-EPT [v18,119/121] RFC: KVM: x86: Add x86 callback to check cpuid [v18,120/121] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2 [v18,121/121,MARKER] the end of (the first phase of) TDX KVM patch series

diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst index 851e99174762..63a78bd41b16 100644 --- a/Documentation/virt/kvm/x86/index.rst +++ b/Documentation/virt/kvm/x86/index.rst @@ -16,4 +16,5 @@ KVM for x86 systems msr nested-vmx running-nested-guests + tdx-tdp-mmu timekeeping diff --git a/Documentation/virt/kvm/x86/tdx-tdp-mmu.rst b/Documentation/virt/kvm/x86/tdx-tdp-mmu.rst new file mode 100644 index 000000000000..49d103720272 --- /dev/null +++ b/Documentation/virt/kvm/x86/tdx-tdp-mmu.rst @@ -0,0 +1,443 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Design of TDP MMU for TDX support +================================= +This document describes a (high level) design for TDX support of KVM TDP MMU of +x86 KVM. + +In this document, we use "TD" or "guest TD" to differentiate it from the current +"VM" (Virtual Machine), which is supported by KVM today. + + +Background of TDX +================= +TD private memory is designed to hold TD private content, encrypted by the CPU +using the TD ephemeral key. An encryption engine holds a table of encryption +keys, and an encryption key is selected for each memory transaction based on a +Host Key Identifier (HKID). By design, the host VMM does not have access to the +encryption keys. + +In the first generation of MKTME, HKID is "stolen" from the physical address by +allocating a configurable number of bits from the top of the physical address. +The HKID space is partitioned into shared HKIDs for legacy MKTME accesses and +private HKIDs for SEAM-mode-only accesses. We use 0 for the shared HKID on the +host so that MKTME can be opaque or bypassed on the host. + +During TDX non-root operation (i.e. guest TD), memory accesses can be qualified +as either shared or private, based on the value of a new SHARED bit in the Guest +Physical Address (GPA). The CPU translates shared GPAs using the usual VMX EPT +(Extended Page Table) or "Shared EPT" (in this document), which resides in the +host VMM memory. The Shared EPT is directly managed by the host VMM - the same +as with the current VMX. Since guest TDs usually require I/O, and the data +exchange needs to be done via shared memory, thus KVM needs to use the current +EPT functionality even for TDs. + +The CPU translates private GPAs using a separate Secure EPT. The Secure EPT +pages are encrypted and integrity-protected with the TD's ephemeral private key. +Secure EPT can be managed _indirectly_ by the host VMM, using the TDX interface +functions (SEAMCALLs), and thus conceptually Secure EPT is a subset of EPT +because not all functionalities are available. + +Since the execution of such interface functions takes much longer time than +accessing memory directly, in KVM we use the existing TDP code to mirror the +Secure EPT for the TD. And we think there are at least two options today in +terms of the timing for executing such SEAMCALLs: + +1. synchronous, i.e. while walking the TDP page tables, or +2. post-walk, i.e. record what needs to be done to the real Secure EPT during + the walk, and execute SEAMCALLs later. + +The option 1 seems to be more intuitive and simpler, but the Secure EPT +concurrency rules are different from the ones of the TDP or EPT. For example, +MEM.SEPT.RD acquire shared access to the whole Secure EPT tree of the target + +Secure EPT(SEPT) operations +--------------------------- +Secure EPT is an Extended Page Table for GPA-to-HPA translation of TD private +HPA. A Secure EPT is designed to be encrypted with the TD's ephemeral private +key. SEPT pages are allocated by the host VMM via Intel TDX functions, but their +content is intended to be hidden and is not architectural. + +Unlike the conventional EPT, the CPU can't directly read/write its entry. +Instead, TDX SEAMCALL API is used. Several SEAMCALLs correspond to operation on +the EPT entry. + +* TDH.MEM.SEPT.ADD(): + + Add a secure EPT page from the secure EPT tree. This corresponds to updating + the non-leaf EPT entry with present bit set + +* TDH.MEM.SEPT.REMOVE(): + + Remove the secure page from the secure EPT tree. There is no corresponding + to the EPT operation. + +* TDH.MEM.SEPT.RD(): + + Read the secure EPT entry. This corresponds to reading the EPT entry as + memory. Please note that this is much slower than direct memory reading. + +* TDH.MEM.PAGE.ADD() and TDH.MEM.PAGE.AUG(): + + Add a private page to the secure EPT tree. This corresponds to updating the + leaf EPT entry with present bit set. + +* THD.MEM.PAGE.REMOVE(): + + Remove a private page from the secure EPT tree. There is no corresponding + to the EPT operation. + +* TDH.MEM.RANGE.BLOCK(): + + This (mostly) corresponds to clearing the present bit of the leaf EPT entry. + Note that the private page is still linked in the secure EPT. To remove it + from the secure EPT, TDH.MEM.SEPT.REMOVE() and TDH.MEM.PAGE.REMOVE() needs to + be called. + +* TDH.MEM.TRACK(): + + Increment the TLB epoch counter. This (mostly) corresponds to EPT TLB flush. + Note that the private page is still linked in the secure EPT. To remove it + from the secure EPT, tdh_mem_page_remove() needs to be called. + + +Adding private page +------------------- +The procedure of populating the private page looks as follows. + +1. TDH.MEM.SEPT.ADD(512G level) +2. TDH.MEM.SEPT.ADD(1G level) +3. TDH.MEM.SEPT.ADD(2M level) +4. TDH.MEM.PAGE.AUG(4K level) + +Those operations correspond to updating the EPT entries. + +Dropping private page and TLB shootdown +--------------------------------------- +The procedure of dropping the private page looks as follows. + +1. TDH.MEM.RANGE.BLOCK(4K level) + + This mostly corresponds to clear the present bit in the EPT entry. This + prevents (or blocks) TLB entry from creating in the future. Note that the + private page is still linked in the secure EPT tree and the existing cache + entry in the TLB isn't flushed. + +2. TDH.MEM.TRACK(range) and TLB shootdown + + This mostly corresponds to the EPT TLB shootdown. Because all vcpus share + the same Secure EPT, all vcpus need to flush TLB. + + * TDH.MEM.TRACK(range) by one vcpu. It increments the global internal TLB + epoch counter. + + * send IPI to remote vcpus + * Other vcpu exits to VMM from guest TD and then re-enter. TDH.VP.ENTER(). + * TDH.VP.ENTER() checks the TLB epoch counter and If its TLB is old, flush + TLB. + + Note that only single vcpu issues tdh_mem_track(). + + Note that the private page is still linked in the secure EPT tree, unlike the + conventional EPT. + +3. TDH.MEM.PAGE.PROMOTE, TDH.MEM.PAGEDEMOTE(), TDH.MEM.PAGE.RELOCATE(), or + TDH.MEM.PAGE.REMOVE() + + There is no corresponding operation to the conventional EPT. + + * When changing page size (e.g. 4K <-> 2M) TDH.MEM.PAGE.PROMOTE() or + TDH.MEM.PAGE.DEMOTE() is used. During those operation, the guest page is + kept referenced in the Secure EPT. + + * When migrating page, TDH.MEM.PAGE.RELOCATE(). This requires both source + page and destination page. + * when destroying TD, TDH.MEM.PAGE.REMOVE() removes the private page from the + secure EPT tree. In this case TLB shootdown is not needed because vcpus + don't run any more. + +The basic idea for TDX support +============================== +Because shared EPT is the same as the existing EPT, use the existing logic for +shared EPT. On the other hand, secure EPT requires additional operations +instead of directly reading/writing of the EPT entry. + +On EPT violation, The KVM mmu walks down the EPT tree from the root, determines +the EPT entry to operate, and updates the entry. If necessary, a TLB shootdown +is done. Because it's very slow to directly walk secure EPT by TDX SEAMCALL, +TDH.MEM.SEPT.RD(), the mirror of secure EPT is created and maintained. Add +hooks to KVM MMU to reuse the existing code. + +EPT violation on shared GPA +--------------------------- +(1) EPT violation on shared GPA or zapping shared GPA + :: + + walk down shared EPT tree (the existing code) + | + | + V + shared EPT tree (CPU refers.) + +(2) update the EPT entry. (the existing code) + + TLB shootdown in the case of zapping. + + +EPT violation on private GPA +---------------------------- +(1) EPT violation on private GPA or zapping private GPA + :: + + walk down the mirror of secure EPT tree (mostly same as the existing code) + | + | + V + mirror of secure EPT tree (KVM MMU software only. reuse of the existing code) + +(2) update the (mirrored) EPT entry. (mostly same as the existing code) + +(3) call the hooks with what EPT entry is changed + :: + + | + NEW: hooks in KVM MMU + | + V + secure EPT root(CPU refers) + +(4) the TDX backend calls necessary TDX SEAMCALLs to update real secure EPT. + +The major modification is to add hooks for the TDX backend for additional +operations and to pass down which EPT, shared EPT, or private EPT is used, and +twist the behavior if we're operating on private EPT. + +The following depicts the relationship. +:: + + KVM | TDX module + | | | + -------------+---------- | | + | | | | + V V | | + shared GPA private GPA | V + CPU shared EPT pointer KVM private EPT pointer | CPU secure EPT pointer + | | | | + | | | | + V V | V + shared EPT private EPT<-------mirror----->Secure EPT + | | | | + | \--------------------+------\ | + | | | | + V | V V + shared guest page | private guest page + | + | + non-encrypted memory | encrypted memory + | + +shared EPT: CPU and KVM walk with shared GPA + Maintained by the existing code +private EPT: KVM walks with private GPA + Maintained by the twisted existing code +secure EPT: CPU walks with private GPA. + Maintained by TDX module with TDX SEAMCALLs via hooks + + +Tracking private EPT page +========================= +Shared EPT pages are managed by struct kvm_mmu_page. They are linked in a list +structure. When necessary, the list is traversed to operate on. Private EPT +pages have different characteristics. For example, private pages can't be +swapped out. When shrinking memory, we'd like to traverse only shared EPT pages +and skip private EPT pages. Likewise, page migration isn't supported for +private pages (yet). Introduce an additional list to track shared EPT pages and +track private EPT pages independently. + +At the beginning of EPT violation, the fault handler knows fault GPA, thus it +knows which EPT to operate on, private or shared. If it's private EPT, +an additional task is done. Something like "if (private) { callback a hook }". +Since the fault handler has deep function calls, it's cumbersome to hold the +information of which EPT is operating. Options to mitigate it are + +1. Pass the information as an argument for the function call. +2. Record the information in struct kvm_mmu_page somehow. +3. Record the information in vcpu structure. + +Option 2 was chosen. Because option 1 requires modifying all the functions. It +would affect badly to the normal case. Option 3 doesn't work well because in +some cases, we need to walk both private and shared EPT. + +The role of the EPT page can be utilized and one bit can be curved out from +unused bits in struct kvm_mmu_page_role. When allocating the EPT page, +initialize the information. Mostly struct kvm_mmu_page is available because +we're operating on EPT pages. + + +The conversion of private GPA and shared GPA +============================================ +A page of a given GPA can be assigned to only private GPA xor shared GPA at one +time. (This is the restriction by KVM implementation to avoid doubling guest +memory usage. Not by TDX architecture.) The GPA can't be accessed +simultaneously via both private GPA and shared GPA. On guest startup, all the +GPAs are assigned as private. Guest converts the range of GPA to shared (or +private) from private (or shared) by MapGPA hypercall. MapGPA hypercall takes +the start GPA and the size of the region. If the given start GPA is shared +(shared bit set), VMM converts the region into shared (if it's already shared, +nop). + +If the guest TD triggers an EPT violation on the already converted region, +i.e. EPT violation on private(or shared) GPA when page is shared(or private), +the access won't be allowed. KVM_EXIT_MEMORY_FAULT is triggered. The user +space VMM will decide how to handle it. + +If the guest access private (or shared) GPA after the conversion to shared (or +private), the following sequence will be observed + +1. MapGPA(shared GPA: shared bit set) hypercall +2. KVM cause KVM_TDX_EXIT with hypercall to the user space VMM. +3. The user space VMM converts the GPA with KVM_SET_MEMORY_ATTRIBUTES(shared). +4. The user space VMM resumes vcpu execution with KVM_VCPU_RUN +5. Guest TD accesses private GPA (shared bit cleared) +6. KVM gets EPT violation on private GPA (shared bit cleared) +7. KVM finds the GPA was set to be shared in the xarray while the faulting GPA + is private (shared bit cleared) +8. KVM_EXIT_MEMORY_FAULT. User space VMM, e.g. qemu, decide what to do. + Typically requests KVM conversion of GPA without MapGPA hypercall. +9. KVM converts GPA from shared to private with + KVM_SET_MEMORY_ATTRIBUTES(private) +10. Resume vcpu execution + +At step 9, user space VMM may think such memory access is due to race, let vcpu +resume without conversion with the expectation that other vcpu issues MapGPA. +Or user space VMM may think such memory access is doubtful and the guest is +trying to attack VMM. It may throttle vcpu execution as mitigation or finally +kill such a guest. Or user space VMM may think it's a bug of the guest TD, kill +the guest TD. + +This sequence is not efficient. Guest TD shouldn't access private (or shared) +GPA after converting GPA to shared (or private). Although KVM can handle it, +it's sub-optimal and won't be optimized. + +The original TDP MMU and race condition +======================================= +Because vcpus share the EPT, once the EPT entry is zapped, we need to shootdown +TLB. Send IPI to remote vcpus. Remote vcpus flush their down TLBs. Until TLB +shootdown is done, vcpus may reference the zapped guest page. + +TDP MMU uses read lock of mmu_lock to mitigate vcpu contention. When read lock +is obtained, it depends on the atomic update of the EPT entry. (On the other +hand legacy MMU uses write lock.) When vcpu is populating/zapping the EPT entry +with a read lock held, other vcpu may be populating or zapping the same EPT +entry at the same time. + +To avoid the race condition, the entry is frozen. It means the EPT entry is set +to the special value, REMOVED_SPTE which clears the present bit. And then after +TLB shootdown, update the EPT entry to the final value. + +Concurrent zapping +------------------ +1. read lock +2. freeze the EPT entry (atomically set the value to REMOVED_SPTE) + If other vcpu froze the entry, restart page fault. +3. TLB shootdown + + * send IPI to remote vcpus + * TLB flush (local and remote) + + For each entry update, TLB shootdown is needed because of the + concurrency. +4. atomically set the EPT entry to the final value +5. read unlock + +Concurrent populating +--------------------- +In the case of populating the non-present EPT entry, atomically update the EPT +entry. + +1. read lock + +2. atomically update the EPT entry + If other vcpu frozen the entry or updated the entry, restart page fault. + +3. read unlock + +In the case of updating the present EPT entry (e.g. page migration), the +operation is split into two. Zapping the entry and populating the entry. + +1. read lock +2. zap the EPT entry. follow the concurrent zapping case. +3. populate the non-present EPT entry. +4. read unlock + +Non-concurrent batched zapping +------------------------------ +In some cases, zapping the ranges is done exclusively with a write lock held. +In this case, the TLB shootdown is batched into one. + +1. write lock +2. zap the EPT entries by traversing them +3. TLB shootdown +4. write unlock + +For Secure EPT, TDX SEAMCALLs are needed in addition to updating the mirrored +EPT entry. + +TDX concurrent zapping +---------------------- +Add a hook for TDX SEAMCALLs at the step of the TLB shootdown. + +1. read lock +2. freeze the EPT entry(set the value to REMOVED_SPTE) +3. TLB shootdown via a hook + + * TLB.MEM.RANGE.BLOCK() + * TLB.MEM.TRACK() + * send IPI to remote vcpus + +4. set the EPT entry to the final value +5. read unlock + +TDX concurrent populating +------------------------- +TDX SEAMCALLs are required in addition to operating the mirrored EPT entry. The +frozen entry is utilized by following the zapping case to avoid the race +condition. A hook can be added. + +1. read lock +2. freeze the EPT entry +3. hook + + * TDH_MEM_SEPT_ADD() for non-leaf or TDH_MEM_PAGE_AUG() for leaf. + +4. set the EPT entry to the final value +5. read unlock + +Without freezing the entry, the following race can happen. Suppose two vcpus +are faulting on the same GPA and the 2M and 4K level entries aren't populated +yet. + +* vcpu 1: update 2M level EPT entry +* vcpu 2: update 4K level EPT entry +* vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error +* vcpu 1: TDX SEAMCALL to update 2M secure EPT entry + + +TDX non-concurrent batched zapping +---------------------------------- +For simplicity, the procedure of concurrent populating is utilized. The +procedure can be optimized later. + + +Co-existing with unmapping guest private memory +=============================================== +TODO. This needs to be addressed. + + +Restrictions or future work +=========================== +The following features aren't supported yet at the moment. + +* optimizing non-concurrent zap +* Large page +* Page migration

[v18,117/121] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU

Commit Message

Patch