mbox series

[v7,000/102] KVM TDX basic feature support

Message ID cover.1656366337.git.isaku.yamahata@intel.com (mailing list archive)
Headers show
Series KVM TDX basic feature support | expand

Message

Isaku Yamahata June 27, 2022, 9:52 p.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

KVM TDX basic feature support

Hello.  This is v7 the patch series vof KVM TDX support.
This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM

Major changes from v6:
- rebased to v5.19 base

TODO:
- integrate fd-based guest memory. As the discussion is still on-going, I
  intentionally dropped fd-based guest memory support yet.  The integration can
  be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
- 2M large page support. It's work-in-progress.
For large page support, there are several design choices. Here is the design options.
Any thoughts/feedback?

KVM MMU Large page support for TDX

* What needs to be done
- Track private or shared of each page size (4KB, 2MB, 1GB) based on
  TDG.VP.VMCALL<MapGPA>.  For large pages(2MB, 1GB), it can be mixed (some
  lower-size pages are private and some shared.)  In this case, the page can't
  be large.
- if necessary, split large page on TDG.VP.VMCALL<MapGPA>
  (split on dirty page tracking is future work)
- resolving KVM page fault
  When resolving a private page and the page is large in the host, GPA can be
  resolved as a large page in Secure-EPT.  Even if the page is large on the host
  side, sometimes a 4KB page can be resolved because it's up to guest TD to
  accept at 4KB, 2MB, or 1GB.
- collapsing pages into a large page.
  At this point, it's okay to not implement this.  When dirty page tracking is
  supported, this needs to be supported.
  - On MapGPA, the page can be collapsed into a large page
  - handle zapping SPTE and try to collapse the pages on the next KVM page fault
    Unlike the EPT case, some trick is needed.
- For performance, optimize KVM page fault path at the cost of complicating
  MapGPA path.

* options to track private or shared
At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
1GB case). For 4KB each page, 1 bit per page is needed. private or shared.  For
large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
mixed).  When resolving KVM page fault, we don't want to check the lower-size
pages to check if the given GPA can be a large for performance.  On MapGPA check
it instead.

Option A). enhance kvm_arch_memory_slot
  enum kvm_page_type {
       KVM_PAGE_TYPE_INVALID,
       KVM_PAGE_TYPE_SHARED,
       KVM_PAGE_TYPE_PRIVATE,
       KVM_PAGE_TYPE_MIXED,
  };

  struct kvm_page_attr {
       enum kvm_page_type type;
  };

 struct kvm_arch_memory_slot {
 +      struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];

Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
If !SPTE_MIXED_MASK, it can be large page.

Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.


* comparison
A).
+ straightforward to implement
+ SPTE_SHARED_MASK isn't needed
- memory overhead compared to B). or C).
- more memory reference on KVM page fault

B).
+ simpler than C) (complex than A)?)
+ efficient on KVM page fault. (only SPTE reference)
+ low memory overhead
- Waste precious SPTE bits.

C).
+ efficient on KVM page fault. (only SPTE reference)
+ low memory overhead
- complicates MapGPA
- scattered data structure

Thanks,
Isaku Yamahata

Changes from v6:
- rebased to v5.19

Changes from v5:
- export __seamcall and use it
- move mutex lock from callee function of smp_call_on_cpu to the caller.
- rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
- updated comment
- drop the use of tdh_mng_key.reclaimid(): as the function is for backward
  compatibility to only return success
- struct kvm_tdx_cmd: metadata => flags, added __u64 error.
- make this ioctl systemwide ioctl
- ABI change to struct kvm_init_vm
- guest_tsc_khz: use kvm->arch.default_tsc_khz
- rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
- drop exporting kvm_set_tsc_khz().
- fix kvm_tdp_page_fault() for mtrr emulation
- rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
- drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
  keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
- update commit message
- rename shadow_init_value => shadow_nonprsent_value
- added ept_violation_ve_test mode
- shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
- legacy MMU case
  => - mmu_topup_shadow_page_cache(), kvm_mmu_create()
     - FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
- #VE warning:
- rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
- merge into Like we discussed, this patch should be merged with patch
  "KVM: x86/mmu: Allow non-zero init value for shadow PTE".
- fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
- introduce kvm_gfn_for_root(kvm, root, gfn)
- add only_shared argument to kvm_tdp_mmu_handle_gfn()
- use kvm_arch_dirty_log_supported()
- rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
- rename: is_private_prohibit_spte() => spte_shared_mask()
- fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
- dropped this patch as the change was merged into kvm/queue
- update vt_apicv_post_state_restore()
- use is_64_bit_hypercall()
- comment: expand MSMI -> Machine Check System Management Interrupt
- fixed TDX_SEPT_PFERR
- tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
- rename tdmvcall_exit_readon() => tdvmcall_leaf()
- remove optional zero check of argument.
- do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
   in kvm_vcpu_ioctl_smi and __apic_accept_irq.
- WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
- introduce vcpu_deliver_init to x86_ops
- sprinkeled KVM_BUG_ON()

Changes from v4:
- rebased to TDX host kernel patch series.
- include all the patches to make this patch series working.
- add [MARKER] patches to mark the patch layer clear.

---
* What's TDX?
TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
Domain (TD) for confidential computing.

A TD runs in a CPU mode that is designed to protect the confidentiality of its
memory contents and its CPU state from any other software, including the hosting
Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.

We have more detailed explanations below (***).
We have the high-level design of TDX KVM below (****).

In this patch series, we use "TD" or "guest TD" to differentiate it from the
current "VM" (Virtual Machine), which is supported by KVM today.


* The organization of this patch series
This patch series is on top of the patches series "TDX host kernel support":
https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/

this patch series is available at
https://github.com/intel/tdx/releases/tag/kvm-upstream
The corresponding patches to qemu are available at
https://github.com/intel/qemu-tdx/commits/tdx-upstream

The relations of the layers are depicted as follows.
The arrows below show the order of patch reviews we would like to have.

The below layers are chosen so that the device model, for example, qemu can
exercise each layering step by step.  Check if TDX is supported, create TD VM,
create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
vcpu exits/hypercalls/interrupts to run TD fully.

  TDX vcpu
  interrupt/exits/hypercall<------------\
        ^                               |
        |                               |
  TD finalization                       |
        ^                               |
        |                               |
  TDX EPT violation<------------\       |
        ^                       |       |
        |                       |       |
  TD vcpu enter/exit            |       |
        ^                       |       |
        |                       |       |
  TD vcpu creation/destruction  |       \-------KVM TDP MMU MapGPA
        ^                       |                       ^
        |                       |                       |
  TD VM creation/destruction    \---------------KVM TDP MMU hooks
        ^                                               ^
        |                                               |
  TDX architectural definitions                 KVM TDP refactoring for TDX
        ^                                               ^
        |                                               |
   TDX, VMX    <--------TDX host kernel         KVM MMU GPA stolen bits
   coexistence          support


The followings are explanations of each layer.  Each layer has a dummy commit
that starts with [MARKER] in subject.  It is intended to help to identify where
each layer starts.

TDX host kernel support:
        https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
        The guts of system-wide initialization of TDX module.  There is an
        independent patch series for host x86.  TDX KVM patches call functions
        this patch series provides to initialize the TDX module.

TDX, VMX coexistence:
        Infrastructure to allow TDX to coexist with VMX and trigger the
        initialization of the TDX module.
        This layer starts with
        "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
TDX architectural definitions:
        Add TDX architectural definitions and helper functions
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
TD VM creation/destruction:
        Guest TD creation/destroy allocation and releasing of TDX specific vm
        and vcpu structure.  Create an initial guest memory image with TDX
        measurement.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
TD vcpu creation/destruction:
        guest TD creation/destroy Allocation and releasing of TDX specific vm
        and vcpu structure.  Create an initial guest memory image with TDX
        measurement.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
TDX EPT violation:
        Create an initial guest memory image with TDX measurement.  Handle
        secure EPT violations to populate guest pages with TDX SEAMCALLs.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
TD vcpu enter/exit:
        Allow TDX vcpu to enter into TD and exit from TD.  Save CPU state before
        entering into TD.  Restore CPU state after exiting from TD.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
TD vcpu interrupts/exit/hypercall:
        Handle various exits/hypercalls and allow interrupts to be injected so
        that TD vcpu can continue running.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"

KVM MMU GPA shared bit:
        Introduce framework to handle shared bit repurposed bit of GPA TDX
        repurposed a bit of GPA to indicate shared or private. If it's shared,
        it's the same as the conventional VMX EPT case.  VMM can access shared
        guest pages.  If it's private, it's handled by Secure-EPT and the guest
        page is encrypted.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
KVM TDP refactoring for TDX:
        TDX Secure EPT requires different constants. e.g. initial value EPT
        entry value etc. Various refactoring for those differences.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
KVM TDP MMU hooks:
        Introduce framework to TDP MMU to add hooks in addition to direct EPT
        access TDX added Secure EPT which is an enhancement to VMX EPT.  Unlike
        conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
        use TDX SEAMCALLs to operate on Secure EPT.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
KVM TDP MMU MapGPA:
        Introduce framework to handle switching guest pages from private/shared
        to shared/private.  For a given GPA, a guest page can be assigned to a
        private GPA or a shared GPA exclusively.  With TDX MapGPA hypercall,
        guest TD converts GPA assignments from private (or shared) to shared (or
        private).
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "

KVM guest private memory: (not shown in the above diagram)
[PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
memory: https://lkml.org/lkml/2022/1/18/395
        Guest private memory requires different memory management in KVM.  The
        patch proposes a way for it.  Integration with TDX KVM.

(***)
* TDX module
A CPU-attested software module called the "TDX module" is designed to implement
the TDX architecture, and it is loaded by the UEFI firmware today. It can be
loaded by the kernel or driver at runtime, but in this patch series we assume
that the TDX module is already loaded and initialized.

The TDX module provides two main new logical modes of operation built upon the
new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
architecture. TDX root mode is mostly identical to the VMX root operation mode,
and the TDX functions (described later) are triggered by the new SEAMCALL
instruction with the desired interface function selected by an input operand
(leaf number, in RAX). TDX non-root mode is used for TD guest operation.  TDX
non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
operation (i.e. guest VM), with changes and restrictions to better assure that
no other software or hardware has direct visibility of the TD memory and state.

TDX transitions between TDX root operation and TDX non-root operation include TD
Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
TDX root mode.  A TD Exit might be asynchronous, triggered by some external
event (e.g., external interrupt or SMI) or an exception, or it might be
synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.

TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
of the TDX interface functions as mentioned above, and "TDH" stands for Trust
Domain Host. Those host-side TDX interface functions are categorized into
various areas just for better organization, such as SYS (TDX module management),
MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.

TDCS (Trust Domain Control Structure) is the main control structure of a guest
TD, and encrypted (using the guest TD's ephemeral private key).  At a high
level, TDCS holds information for controlling TD operation as a whole,
execution, EPTP, MSR bitmaps, etc that KVM needs to set it up.  Note that MSR
bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
same value for all VCPUs of the same TD.

Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
TD VCPU.  It helps the TDX module control the operation of the VCPU, and holds
the VCPU state while the VCPU is not running. TDVPS is opaque to software and
DMA access, accessible only by using the TDX module interface functions (such as
TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
such as virtual APIC page, virtualization exception information, etc.

Several VMX control structures (such as Shared EPT and Posted interrupt
descriptor) are directly managed and accessed by the host VMM.  These control
structures are pointed to by fields in the TD VMCS.

The above means that 1) KVM needs to allocate different data structures for TDs,
2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
define TD-specific handling for others.  3) Redirect operations to .  3)
Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
tdx_callback() else vmx_callback();".

*TD Private Memory
TD private memory is designed to hold TD private content, encrypted by the CPU
using the TD ephemeral key. An encryption engine holds a table of encryption
keys, and an encryption key is selected for each memory transaction based on a
Host Key Identifier (HKID). By design, the host VMM does not have access to the
encryption keys.

In the first generation of MKTME, HKID is "stolen" from the physical address by
allocating a configurable number of bits from the top of the physical
address. The HKID space is partitioned into shared HKIDs for legacy MKTME
accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
HKID on the host so that MKTME can be opaque or bypassed on the host.

During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
as either shared or private, based on the value of a new SHARED bit in the Guest
Physical Address (GPA).  The CPU translates shared GPAs using the usual VMX EPT
(Extended Page Table) or "Shared EPT" (in this document), which resides in host
VMM memory. The Shared EPT is directly managed by the host VMM - the same as
with the current VMX. Since guest TDs usually require I/O, and the data exchange
needs to be done via shared memory, thus KVM needs to use the current EPT
functionality even for TDs.

* Secure EPT and Minoring using the TDP code
The CPU translates private GPAs using a separate Secure EPT.  The Secure EPT
pages are encrypted and integrity-protected with the TD's ephemeral private
key.  Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
interface functions, and thus conceptually Secure EPT is a subset of EPT (why
"subset"). Since execution of such interface functions takes much longer time
than accessing memory directly, in KVM we use the existing TDP code to minor the
Secure EPT for the TD.

This way, we can effectively walk Secure EPT without using the TDX interface
functions.

* VM life cycle and TDX specific operations
The userspace VMM, such as QEMU, needs to build and treat TDs differently.  For
example, a TD needs to boot in private memory, and the host software cannot copy
the initial image to private memory.

* TSC Virtualization
The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
(e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
by TD configuration, i.e. when the TD is created, not per VCPU.  The current KVM
owns TSC virtualization for VMs, but the TDX module does for TDs.

* MCE support for TDs
The TDX module doesn't allow VMM to inject MCE.  Instead PV way is needed for TD
to communicate with VMM.  For now, KVM silently ignores MCE request by VMM.  MSRs
related to MCE (e.g, MCE bank registers) can be naturally emulated by
paravirtualizing MSR access.

[1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
available.

* Restrictions or future work
Some features are not included to reduce patch size.  Those features are
addressed as future independent patch series.
- large page (2M, 1G)
- qemu gdb stub
- guest PMU
- and more

* Prerequisites
It's required to load the TDX module and initialize it.  It's out of the scope
of this patch series.  Another independent patch for the common x86 code is
planned.  It defines CONFIG_INTEL_TDX_HOST and this patch series uses
CONFIG_INTEL_TDX_HOST.  It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
module is initialized and ready for KVM to use the TDX module APIs for TDX guest
life cycle like tdh.mng.init are ready to use.

Concretely Global initialization, LP (Logical Processor) initialization, global
configuration, the key configuration, and TDMR and PAMT initialization are done.
The state of the TDX module is SYS_READY.  Please refer to the TDX module
specification, the chapter Intel TDX Module Lifecycle State Machine

** Detecting the TDX module readiness.
TDX host patch series implements the detection of the TDX module availability
and its initialization so that KVM can use it.  Also it manages Host KeyID
(HKID) assigned to guest TD.
The assumed APIs the TDX host patch series provides are
- int seamrr_enabled()
  Check if required cpu feature (SEAM mode) is available. This only check CPU
  feature availability.  At this point, the TDX module may not be ready for KVM
  to use.
- int init_tdx(void);
  Initialization of TDX module so that the TDX module is ready for KVM to use.
- const struct tdsysinfo_struct *tdx_get_sysinfo(void);
  Return the system wide information about the TDX module.  NULL if the TDX
  isn't initialized.
- u32 tdx_get_global_keyid(void);
  Return global key id that is used for the TDX module itself.
- int tdx_keyid_alloc(void);
  Allocate HKID for guest TD.
- void tdx_keyid_free(int keyid);
  Free HKID for guest TD.

(****)
* TDX KVM high-level design
- Host key ID management
Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
It is assumed The TDX host patch series implements necessary functions,
u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
void tdx_keyid_free(int keyid).

- Data structures and VM type
Because TDX is different from VMX, define its own VM/VCPU structures, struct
kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx.  To
identify the VM, introduce VM-type to specify which VM type, VMX (default) or
TDX, is used.

- VM life cycle and TDX specific operations
Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
parameters, set initial guest memory and measurement.

The creation of TDX VM requires five additional operations in addition to the
conventional VM creation.
  - Get KVM system capability to check if TDX VM type is supported
  - VM creation (KVM_CREATE_VM)
  - New: Get the TDX specific system parameters.  KVM_TDX_GET_CAPABILITY.
  - New: Set TDX specific VM parameters.  KVM_TDX_INIT_VM.
  - VCPU creation (KVM_CREATE_VCPU)
  - New: Set TDX specific VCPU parameters.  KVM_TDX_INIT_VCPU.
  - New: Initialize guest memory as boot state and extend the measurement with
    the memory.  KVM_TDX_INIT_MEM_REGION.
  - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
    TDX VM contents.
  - VCPU RUN (KVM_VCPU_RUN)

- Protected guest state
Because the guest state (CPU state and guest memory) is protected, the KVM VMM
can't operate on them.  For example, accessing CPU registers, injecting
exceptions, and accessing guest memory.  Those operations are handled as
silently ignored, returning zero or initial reset value when it's requested via
KVM API ioctls.

    VM/VCPU state and callbacks for TDX specific operations.
    Define tdx specific VM state and VCPU state instead of VMX ones.  Redirect
    operations to TDX specific callbacks.  "if (tdx) tdx_op() else vmx_op()".

    Operations on the CPU state
    silently ignore operations on the guest state.  For example, the write to
    CPU registers is ignored and the read from CPU registers returns 0.

    . ignore access to CPU registers except for allowed ones.
    . TSC: add a check if tsc is immutable and return an error.  Because the KVM
      implementation updates the internal tsc state and it's difficult to back
      out those changes.  Instead, skip the logic.
    . dirty logging: add check if dirty logging is supported.
    . exceptions/SMI/MCE/SIPI/INIT: silently ignore

    Note: virtual external interrupt and NMI can be injected into TDX guests.

- KVM MMU integration
One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
the guest physical address is private (the bit is cleared) or shared (the bit is
set).  The bits are called stolen bits.

  - Stolen bits framework
    systematically tracks which guest physical address, shared or private, is
    used.

  - Shared EPT and secure EPT
    There are two EPTs. Shared EPT (the conventional one) and Secure
    EPT(the new one). Shared EPT is handled the same for the stolen
    bit set.  Secure EPT points to private guest pages.  To resolve
    EPT violation, KVM walks one of two EPTs based on faulted GPA.
    Because it's costly to access secure EPT during walking EPTs with
    SEAMCALLs for the private guest physical address, another private
    EPT is used as a shadow of Secure-EPT with the existing logic at
    the cost of extra memory.

The following depicts the relationship.

                    KVM                             |       TDX module
                     |                              |           |
        -------------+----------                    |           |
        |                      |                    |           |
        V                      V                    |           |
     shared GPA           private GPA               |           |
  CPU shared EPT pointer  KVM private EPT pointer   |  CPU secure EPT pointer
        |                      |                    |           |
        |                      |                    |           |
        V                      V                    |           V
  shared EPT                private EPT--------mirror----->Secure EPT
        |                      |                    |           |
        |                      \--------------------+------\    |
        |                                           |      |    |
        V                                           |      V    V
  shared guest page                                 |    private guest page
                                                    |
                                                    |
                              non-encrypted memory  |    encrypted memory
                                                    |

  - Operating on Secure EPT
    Use the TDX module APIs to operate on Secure EPT.  To call the TDX API
    during resolving EPT violation, add hooks to additional operation and wiring
    it to TDX backend.

* References

[1] TDX specification
   https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Intel Trust Domain Extensions (Intel TDX)
   https://cdrdv2.intel.com/v1/dl/getContent/726790
[3] Intel CPU Architectural Extensions Specification
   https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
[4] Intel TDX Module 1.0 Specification
   https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
[5] Intel TDX Loader Interface Specification
  https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
[6] Intel TDX Guest-Hypervisor Communication Interface
   https://cdrdv2.intel.com/v1/dl/getContent/726790
[7] Intel TDX Virtual Firmware Design Guide
   https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
[8] intel public github
   kvm TDX branch: https://github.com/intel/tdx/tree/kvm
   TDX guest branch: https://github.com/intel/tdx/tree/guest
   qemu TDX https://github.com/intel/qemu-tdx
[9] TDVF
    https://github.com/tianocore/edk2-staging/tree/TDVF
    This was merged into EDK2 main branch. https://github.com/tianocore/edk2

Chao Gao (3):
  KVM: x86: Move check_processor_compatibility from init ops to runtime
    ops
  Partially revert "KVM: Pass kvm_init()'s opaque param to additional
    arch funcs"
  KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
    wrmsr

Isaku Yamahata (72):
  KVM: Refactor CPU compatibility check on module initialiization
  x86/virt/vmx/tdx: export platform_tdx_enabled()
  KVM: TDX: Detect CPU feature on kernel module initialization
  KVM: x86: Refactor KVM VMX module init/exit functions
  KVM: TDX: Add placeholders for TDX VM/vcpu structure
  x86/virt/tdx: Add a helper function to return system wide info about
    TDX module
  KVM: TDX: Initialize TDX module when loading kvm_intel.ko
  KVM: TDX: Make TDX VM type supported
  [MARKER] The start of TDX KVM patch series: TDX architectural
    definitions
  KVM: TDX: Define TDX architectural definitions
  KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
  KVM: TDX: Add helper functions to print TDX SEAMCALL error
  [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
  x86/cpu: Add helper functions to allocate/free TDX private host key id
  KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
  KVM: TDX: Make pmu_intel.c ignore guest TD case
  [MARKER] The start of TDX KVM patch series: TD vcpu
    creation/destruction
  KVM: TDX: allocate/free TDX vcpu structure
  KVM: TDX: allocate/free TDX vcpu structure
  [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
  KVM: x86/mmu: introduce config for PRIVATE KVM MMU
  [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
    TDX
  KVM: x86/mmu: Disallow fast page fault on private GPA
  KVM: VMX: Introduce test mode related to EPT violation VE
  [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
  KVM: x86/mmu: Focibly use TDP MMU for TDX
  KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
  KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
  KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
  [MARKER] The start of TDX KVM patch series: TDX EPT violation
  KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
  KVM: TDX: don't request KVM_REQ_APIC_PAGE_RELOAD
  KVM: TDX: TDP MMU TDX support
  [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
  KVM: x86/mmu: steal software usable git to record if GFN is for shared
    or not
  KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
  [MARKER] The start of TDX KVM patch series: TD finalization
  KVM: TDX: Create initial guest memory
  KVM: TDX: Finalize VM initialization
  [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
  KVM: TDX: Add helper assembly function to TDX vcpu
  KVM: TDX: Implement TDX vcpu enter/exit path
  KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
  KVM: TDX: restore host xsave state when exit from the guest TD
  KVM: TDX: restore user ret MSRs
  [MARKER] The start of TDX KVM patch series: TD vcpu
    exits/interrupts/hypercalls
  KVM: TDX: complete interrupts after tdexit
  KVM: TDX: restore debug store when TD exit
  KVM: TDX: handle vcpu migration over logical processor
  KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
    behavior
  KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
  KVM: TDX: Implement interrupt injection
  KVM: TDX: Implements vcpu request_immediate_exit
  KVM: TDX: Implement methods to inject NMI
  KVM: TDX: Add a place holder to handle TDX VM exit
  KVM: TDX: handle EXIT_REASON_OTHER_SMI
  KVM: TDX: handle ept violation/misconfig exit
  KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
  KVM: TDX: Add a place holder for handler of TDX hypercalls
    (TDG.VP.VMCALL)
  KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
  KVM: TDX: Handle TDX PV CPUID hypercall
  KVM: TDX: Handle TDX PV HLT hypercall
  KVM: TDX: Handle TDX PV port io hypercall
  KVM: TDX: Implement callbacks for MSR operations for TDX
  KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
  KVM: TDX: Handle TDX PV report fatal error hypercall
  KVM: TDX: Handle TDX PV map_gpa hypercall
  KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
  KVM: TDX: Silently discard SMI request
  KVM: TDX: Silently ignore INIT/SIPI
  Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
  KVM: x86: design documentation on TDX support of x86 KVM TDP MMU

Rick Edgecombe (1):
  KVM: x86/mmu: Add address conversion functions for TDX shared bits

Sean Christopherson (25):
  KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
  KVM: Enable hardware before doing arch VM initialization
  KVM: x86: Introduce vm_type to differentiate default VMs from
    confidential VMs
  KVM: TDX: Add TDX "architectural" error codes
  KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
  KVM: TDX: create/destroy VM structure
  KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
  KVM: TDX: Do TDX specific vcpu initialization
  KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
  KVM: x86/mmu: Allow non-zero value for non-present SPTE
  KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
  KVM: x86/mmu: Allow per-VM override of the TDP max page level
  KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for
    private mmu
  KVM: x86/mmu: Disallow dirty logging for x86 TDX
  KVM: VMX: Split out guts of EPT violation to common/exposed function
  KVM: VMX: Move setting of EPT MMU masks to common VT-x code
  KVM: TDX: Add load_mmu_pgd method for TDX
  KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
  KVM: TDX: Add support for find pending IRQ in a protected local APIC
  KVM: x86: Assume timer IRQ was injected if APIC state is proteced
  KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
    argument
  KVM: VMX: Move NMI/exception handler to common helper
  KVM: x86: Split core of hypercall emulation to helper function
  KVM: TDX: Handle TDX PV MMIO hypercall
  KVM: TDX: Add methods to ignore accesses to CPU state

Xiaoyao Li (1):
  KVM: TDX: initialize VM with TDX specific parameters

 Documentation/virt/kvm/api.rst                |   30 +-
 .../virt/kvm/intel-tdx-layer-status.rst       |   33 +
 Documentation/virt/kvm/intel-tdx.rst          |  381 +++
 Documentation/virt/kvm/tdx-tdp-mmu.rst        |  466 ++++
 arch/arm64/kvm/arm.c                          |    2 +-
 arch/mips/kvm/mips.c                          |   14 +-
 arch/powerpc/kvm/powerpc.c                    |    2 +-
 arch/riscv/kvm/main.c                         |    2 +-
 arch/s390/kvm/kvm-s390.c                      |    2 +-
 arch/x86/events/intel/ds.c                    |    1 +
 arch/x86/include/asm/kvm-x86-ops.h            |   10 +
 arch/x86/include/asm/kvm_host.h               |   56 +-
 arch/x86/include/asm/tdx.h                    |   67 +
 arch/x86/include/asm/vmx.h                    |   14 +
 arch/x86/include/uapi/asm/kvm.h               |   95 +
 arch/x86/include/uapi/asm/vmx.h               |    5 +-
 arch/x86/kvm/Kconfig                          |    4 +
 arch/x86/kvm/Makefile                         |    3 +-
 arch/x86/kvm/irq.c                            |    3 +
 arch/x86/kvm/lapic.c                          |   37 +-
 arch/x86/kvm/lapic.h                          |    2 +
 arch/x86/kvm/mmu.h                            |   42 +-
 arch/x86/kvm/mmu/mmu.c                        |  360 ++-
 arch/x86/kvm/mmu/mmu_internal.h               |  123 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |    5 +-
 arch/x86/kvm/mmu/spte.c                       |   46 +-
 arch/x86/kvm/mmu/spte.h                       |   65 +-
 arch/x86/kvm/mmu/tdp_iter.c                   |    1 +
 arch/x86/kvm/mmu/tdp_iter.h                   |    5 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  690 ++++-
 arch/x86/kvm/mmu/tdp_mmu.h                    |   12 +-
 arch/x86/kvm/svm/svm.c                        |   13 +-
 arch/x86/kvm/vmx/common.h                     |  174 ++
 arch/x86/kvm/vmx/evmcs.c                      |    2 +-
 arch/x86/kvm/vmx/evmcs.h                      |    2 +-
 arch/x86/kvm/vmx/main.c                       | 1071 +++++++
 arch/x86/kvm/vmx/pmu_intel.c                  |   39 +-
 arch/x86/kvm/vmx/pmu_intel.h                  |   28 +
 arch/x86/kvm/vmx/posted_intr.c                |   43 +-
 arch/x86/kvm/vmx/posted_intr.h                |   13 +
 arch/x86/kvm/vmx/tdx.c                        | 2465 +++++++++++++++++
 arch/x86/kvm/vmx/tdx.h                        |  275 ++
 arch/x86/kvm/vmx/tdx_arch.h                   |  157 ++
 arch/x86/kvm/vmx/tdx_errno.h                  |   29 +
 arch/x86/kvm/vmx/tdx_error.c                  |   22 +
 arch/x86/kvm/vmx/tdx_ops.h                    |  188 ++
 arch/x86/kvm/vmx/vmenter.S                    |  146 +
 arch/x86/kvm/vmx/vmx.c                        |  737 ++---
 arch/x86/kvm/vmx/vmx.h                        |   39 +-
 arch/x86/kvm/vmx/x86_ops.h                    |  235 ++
 arch/x86/kvm/x86.c                            |  148 +-
 arch/x86/virt/vmx/tdx/seamcall.S              |    2 +
 arch/x86/virt/vmx/tdx/tdx.c                   |   54 +-
 arch/x86/virt/vmx/tdx/tdx.h                   |   52 -
 include/linux/kvm_host.h                      |    4 +-
 include/uapi/linux/kvm.h                      |    2 +
 tools/arch/x86/include/uapi/asm/kvm.h         |   95 +
 tools/include/uapi/linux/kvm.h                |    1 +
 virt/kvm/kvm_main.c                           |   67 +-
 59 files changed, 7877 insertions(+), 804 deletions(-)
 create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
 create mode 100644 Documentation/virt/kvm/intel-tdx.rst
 create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
 create mode 100644 arch/x86/kvm/vmx/common.h
 create mode 100644 arch/x86/kvm/vmx/main.c
 create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
 create mode 100644 arch/x86/kvm/vmx/tdx.c
 create mode 100644 arch/x86/kvm/vmx/tdx.h
 create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
 create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
 create mode 100644 arch/x86/kvm/vmx/tdx_error.c
 create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
 create mode 100644 arch/x86/kvm/vmx/x86_ops.h

Comments

Isaku Yamahata July 11, 2022, 3:17 p.m. UTC | #1
Hi. Because my description on large page support was terse, I wrote up more
detailed one.  Any feedback/thoughts on large page support?

TDP MMU large page support design

Two main discussion points
* how to track page status. private vs shared, no-largepage vs can-be-largepage
* how to trigger merging mapping from 4KB/2MB to 2MB/1GB

Expected private-vs-shared page usage
-------------------------------------
On TD boot all pages are private and TD converts pages into shared if necessary.
* Most of the guest pages remain private.
* Only limited pages are converted at kernel boot
  ** bounce buffer for IO (virt-io).  It's allocated as swiotlb.  Its size is
     64MB or 6% of total guest memory.
  ** KVM PV shared page. (the current guest TD doesn't use KVM PV shared page.)
* Only a small number of pages are dynamically converted from private to shared
  and vice versa.  This usage is very limited. e.g. GetQuote, the lack of
  swiotlb buffer


Theory of Secure-EPT operations related to large page
-----------------------------------------------------
TDX Secure-EPT has differences from VMX EPT.
To add a page to Secure-EPT

* Here is the operation to resolve the EPT violation.
1. TD: Accepts GPA.  TD needs to accept GPA before accessing GPA because TD
   needs to detect that VMM unmaps GPA and maps GPA again.
2. EPT violation is triggered.  TD exit to VMM.
3. VMM: allocate a page for GPA and TDH.MEM.PAGE.AUG it to GPA.  Resume TD vcpu.
   (3a. TD: #VE<EPT violation> is injected.  #VE handler accepts the page)
4. TD: resume #VE and continue TD vcpu execution

TD may choose step 1. In that case, After step 3. #VE is injected into TD and,
TD #VE handler needs to accept the page.

When adding a page to Secure-EPT again, the page contexts are cleared and the
page is encrypted.  If a page is disassociated from Secure-EPT and added again,
the page content is lost.

* TDG.VP.VMCALL<MapGPA> hypercall
The page associated with GPA can be private or shared.  TD converts the GPA by
TDG.VP.VMCALL<MapGPA> hypercall from private to shared or vice versa.  VMM
tracks whether the given GPA is private or shared.

* mapping merge(promote)/split(demote)
The page can be mapped as large page (2MB or 1GB) in addition to 4KB.  The
mapping can be merged(4KB/2MB -> 2MB/1GB) or split(2MB/1GB -> 4KB/2MB) by TDX
SEAMCALL TDH.MEM.PAGE.PROMOTE and TDH.MEM.PAGE.DEMOTE.
The merge of mapping requires all the pages needs to be mapped, unlike VMX EPT
because of encryption.  This implies the current KVM implementation doesn't work
for TDX when merging mapping as follows

- EPT violation and host page is 2MB mappable.
  some of the 4KB pages of the given 2MB page are already mapped, some not.
  i.e. 2MB EPT -> 4KB EPT -> 4K pages
- KVM page fault handler zap 2MB EPT entry and populate 2MB EPT entry
  zap: 2MB EPT: non present
  populate 2MB: -> 2MB page

If VMM zaps 2MB Secure-EPT entry, the page contents will be lost for TDX.
Mapping merge requires all pages are already mapped.

Instead, the following steps are needed.
- EPT violation and host page is 2MB mappable.
  some of the 4KB pages of the given 2MB page are already mapped.  Some not.
  i.e. 2MB EPT -> 4KB EPT -> 4K pages
- VMM checks all 4KB GPAs are private. If not, it can't be mapped as a large page.
  (****)
- VMM checks all 4KB GPAs are already mapped.  If not, give up mapping merge.
  (or map missing 4KB pages.)
- mapping merge by TDH.MEM.PAGE.PROMOTE

The mapping split for TDX Secure-EPT works similarly to the VMX EPT case.


EPT violation and MapGPA
------------------------
- EPT violation is a fast path
- MapGPA is not a fast path.
=> Keep the EPT violation path optimized and complicates the MapGPA path.  For
(****) check, we don't want to scan the 4KB mapping on EPT violation.  Instead,
the MapGPA path scans it and records the result as the page can be mapped as 2MB
due to private/shared.


Tracking private/shared and large page mappable
-----------------------------------------------
VMM needs to track that page is mapped as private or shared at 4KB granularity.
For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
track the page can be mapped as a large page (regarding private/shared).  VMM
updates it on MapGPA and references it on the EPT violation path. (****)

For 4KB pages, 1 bit is needed. private or shared.  Let's call it shared-mask bit.
For 2MB/1GB pages, 2 bit is needed. large page mappable or not. private or
shared if mappable.  Let's call it no-largepage bit.

Option A.)
  Allocate array for pages in struct kvm_arch_memory_slot on TD creation.
  struct kvm_arch_memory_slot {
    +struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
  }

  pros:
  +straight forward implementation
  +SPTE_SHARED_MASK is not needed
  cons:
  -memory overhead is high
  -not optimized for expected usage
  -one more look-up on EPT violation

Option B.) Steal two software usable bits from SPTE and record them in SPTE.
           SPTE_SHARED_MASK, SPTE_NOLARGE_PAGE_MASK
  pros:
  +optimized for EPT violation
  cons:
  -2bits used in SPTE entry
  -complicates the MapGPA path.

Option C.) Steal one software usable bit from SPTE and record it in SPTE.
           SPTE_SHARED_MASK
           For 2MB/1GB, allocate bitmap in kvm_mmu_page.
           struct kvm_mmu_page {
             bitmap nolarge
           }
  pros:
  +optimized for EPT violation
  cons:
  -complicates the MapGPA path.
  -information is scattered in SPTE and struct kvm_mmu_page


How to update those bits
------------------------
- MapGPA
  - at 4KB level, set or clear shared-mask bit.
  - Scan 512 4KB bit, at 2MB level
    - set or clear shared-mask bit, clear no-largepage bit or
    - clear shared-mask bit, set no-largepage bit
    - increment/decrement lpageinfo to prevent/allow large page
  - similar for 1GB level
  Note: This logic might a bit tricky.

- EPT violation
  - If 2MB large page is allowed, check if no-largepage bit
    - If no-largepage bit is set, => go down to 4KB page
    - If no-largepage bit is cleared => try to map 2MB page
      - If 4KB level is not mapped, map 2MB page
      - If some 4KB level is already mapped, go down to 4KB.
        Don't try to merge mapping. Or it's possible to try to merge mapping.
  Note: 512 4KB entry scanning is not done at EPT violation because it's fast
        path.


Map merging
-----------
Map merging is necessary for TD migration. (Map split is the easy part.)  The
current KVM implementation zaps the range (mmu notification or lpage recovery
worker) and expects large page mapping on the next EPT violation.

Option A.) Keep the code similar to map merging logic.
Zap 2MB EPT entry in some sense and trigger map merging logic on the next EPT
violation.  To keep encrypted page contents, zapped EPT entries needs to keep
the page.  Steal one more bits from SPTE. SPTE_PRIVATE_BLOCKED_MASK.
It means that the page is zapped from SPTE. but it still alive and references
page.

Option B.) In the callback, directly merge mapping somehow.  In this case, mmu
notifier usage doesn't make sense.

NOTE:
- Implement map merging in MapGPA. This doesn't work for dirty page logging.
- We can utilize kvm_nx_lpage_recovery_worker
- We can utilize THP. Probably doesn't work well for fd-based private memory.

Thanks,
Isaku Yamayhata

On Mon, Jun 27, 2022 at 02:52:52PM -0700,
isaku.yamahata@intel.com wrote:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> KVM TDX basic feature support
> 
> Hello.  This is v7 the patch series vof KVM TDX support.
> This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
> The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
> How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM
> 
> Major changes from v6:
> - rebased to v5.19 base
> 
> TODO:
> - integrate fd-based guest memory. As the discussion is still on-going, I
>   intentionally dropped fd-based guest memory support yet.  The integration can
>   be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
> - 2M large page support. It's work-in-progress.
> For large page support, there are several design choices. Here is the design options.
> Any thoughts/feedback?
> 
> KVM MMU Large page support for TDX
> 
> * What needs to be done
> - Track private or shared of each page size (4KB, 2MB, 1GB) based on
>   TDG.VP.VMCALL<MapGPA>.  For large pages(2MB, 1GB), it can be mixed (some
>   lower-size pages are private and some shared.)  In this case, the page can't
>   be large.
> - if necessary, split large page on TDG.VP.VMCALL<MapGPA>
>   (split on dirty page tracking is future work)
> - resolving KVM page fault
>   When resolving a private page and the page is large in the host, GPA can be
>   resolved as a large page in Secure-EPT.  Even if the page is large on the host
>   side, sometimes a 4KB page can be resolved because it's up to guest TD to
>   accept at 4KB, 2MB, or 1GB.
> - collapsing pages into a large page.
>   At this point, it's okay to not implement this.  When dirty page tracking is
>   supported, this needs to be supported.
>   - On MapGPA, the page can be collapsed into a large page
>   - handle zapping SPTE and try to collapse the pages on the next KVM page fault
>     Unlike the EPT case, some trick is needed.
> - For performance, optimize KVM page fault path at the cost of complicating
>   MapGPA path.
> 
> * options to track private or shared
> At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
> 1GB case). For 4KB each page, 1 bit per page is needed. private or shared.  For
> large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
> mixed).  When resolving KVM page fault, we don't want to check the lower-size
> pages to check if the given GPA can be a large for performance.  On MapGPA check
> it instead.
> 
> Option A). enhance kvm_arch_memory_slot
>   enum kvm_page_type {
>        KVM_PAGE_TYPE_INVALID,
>        KVM_PAGE_TYPE_SHARED,
>        KVM_PAGE_TYPE_PRIVATE,
>        KVM_PAGE_TYPE_MIXED,
>   };
> 
>   struct kvm_page_attr {
>        enum kvm_page_type type;
>   };
> 
>  struct kvm_arch_memory_slot {
>  +      struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
> 
> Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
> If !SPTE_MIXED_MASK, it can be large page.
> 
> Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
> kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.
> 
> 
> * comparison
> A).
> + straightforward to implement
> + SPTE_SHARED_MASK isn't needed
> - memory overhead compared to B). or C).
> - more memory reference on KVM page fault
> 
> B).
> + simpler than C) (complex than A)?)
> + efficient on KVM page fault. (only SPTE reference)
> + low memory overhead
> - Waste precious SPTE bits.
> 
> C).
> + efficient on KVM page fault. (only SPTE reference)
> + low memory overhead
> - complicates MapGPA
> - scattered data structure
> 
> Thanks,
> Isaku Yamahata
> 
> Changes from v6:
> - rebased to v5.19
> 
> Changes from v5:
> - export __seamcall and use it
> - move mutex lock from callee function of smp_call_on_cpu to the caller.
> - rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
> - updated comment
> - drop the use of tdh_mng_key.reclaimid(): as the function is for backward
>   compatibility to only return success
> - struct kvm_tdx_cmd: metadata => flags, added __u64 error.
> - make this ioctl systemwide ioctl
> - ABI change to struct kvm_init_vm
> - guest_tsc_khz: use kvm->arch.default_tsc_khz
> - rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
> - drop exporting kvm_set_tsc_khz().
> - fix kvm_tdp_page_fault() for mtrr emulation
> - rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
> - drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
>   keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
> - update commit message
> - rename shadow_init_value => shadow_nonprsent_value
> - added ept_violation_ve_test mode
> - shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
> - legacy MMU case
>   => - mmu_topup_shadow_page_cache(), kvm_mmu_create()
>      - FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> - #VE warning:
> - rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
> - merge into Like we discussed, this patch should be merged with patch
>   "KVM: x86/mmu: Allow non-zero init value for shadow PTE".
> - fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
> - introduce kvm_gfn_for_root(kvm, root, gfn)
> - add only_shared argument to kvm_tdp_mmu_handle_gfn()
> - use kvm_arch_dirty_log_supported()
> - rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
> - rename: is_private_prohibit_spte() => spte_shared_mask()
> - fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
> - dropped this patch as the change was merged into kvm/queue
> - update vt_apicv_post_state_restore()
> - use is_64_bit_hypercall()
> - comment: expand MSMI -> Machine Check System Management Interrupt
> - fixed TDX_SEPT_PFERR
> - tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
> - rename tdmvcall_exit_readon() => tdvmcall_leaf()
> - remove optional zero check of argument.
> - do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
>    in kvm_vcpu_ioctl_smi and __apic_accept_irq.
> - WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
> - introduce vcpu_deliver_init to x86_ops
> - sprinkeled KVM_BUG_ON()
> 
> Changes from v4:
> - rebased to TDX host kernel patch series.
> - include all the patches to make this patch series working.
> - add [MARKER] patches to mark the patch layer clear.
> 
> ---
> * What's TDX?
> TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
> Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
> Domain (TD) for confidential computing.
> 
> A TD runs in a CPU mode that is designed to protect the confidentiality of its
> memory contents and its CPU state from any other software, including the hosting
> Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
> 
> We have more detailed explanations below (***).
> We have the high-level design of TDX KVM below (****).
> 
> In this patch series, we use "TD" or "guest TD" to differentiate it from the
> current "VM" (Virtual Machine), which is supported by KVM today.
> 
> 
> * The organization of this patch series
> This patch series is on top of the patches series "TDX host kernel support":
> https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
> 
> this patch series is available at
> https://github.com/intel/tdx/releases/tag/kvm-upstream
> The corresponding patches to qemu are available at
> https://github.com/intel/qemu-tdx/commits/tdx-upstream
> 
> The relations of the layers are depicted as follows.
> The arrows below show the order of patch reviews we would like to have.
> 
> The below layers are chosen so that the device model, for example, qemu can
> exercise each layering step by step.  Check if TDX is supported, create TD VM,
> create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
> vcpu exits/hypercalls/interrupts to run TD fully.
> 
>   TDX vcpu
>   interrupt/exits/hypercall<------------\
>         ^                               |
>         |                               |
>   TD finalization                       |
>         ^                               |
>         |                               |
>   TDX EPT violation<------------\       |
>         ^                       |       |
>         |                       |       |
>   TD vcpu enter/exit            |       |
>         ^                       |       |
>         |                       |       |
>   TD vcpu creation/destruction  |       \-------KVM TDP MMU MapGPA
>         ^                       |                       ^
>         |                       |                       |
>   TD VM creation/destruction    \---------------KVM TDP MMU hooks
>         ^                                               ^
>         |                                               |
>   TDX architectural definitions                 KVM TDP refactoring for TDX
>         ^                                               ^
>         |                                               |
>    TDX, VMX    <--------TDX host kernel         KVM MMU GPA stolen bits
>    coexistence          support
> 
> 
> The followings are explanations of each layer.  Each layer has a dummy commit
> that starts with [MARKER] in subject.  It is intended to help to identify where
> each layer starts.
> 
> TDX host kernel support:
>         https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
>         The guts of system-wide initialization of TDX module.  There is an
>         independent patch series for host x86.  TDX KVM patches call functions
>         this patch series provides to initialize the TDX module.
> 
> TDX, VMX coexistence:
>         Infrastructure to allow TDX to coexist with VMX and trigger the
>         initialization of the TDX module.
>         This layer starts with
>         "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
> TDX architectural definitions:
>         Add TDX architectural definitions and helper functions
>         This layer starts with
>         "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
> TD VM creation/destruction:
>         Guest TD creation/destroy allocation and releasing of TDX specific vm
>         and vcpu structure.  Create an initial guest memory image with TDX
>         measurement.
>         This layer starts with
>         "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
> TD vcpu creation/destruction:
>         guest TD creation/destroy Allocation and releasing of TDX specific vm
>         and vcpu structure.  Create an initial guest memory image with TDX
>         measurement.
>         This layer starts with
>         "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
> TDX EPT violation:
>         Create an initial guest memory image with TDX measurement.  Handle
>         secure EPT violations to populate guest pages with TDX SEAMCALLs.
>         This layer starts with
>         "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
> TD vcpu enter/exit:
>         Allow TDX vcpu to enter into TD and exit from TD.  Save CPU state before
>         entering into TD.  Restore CPU state after exiting from TD.
>         This layer starts with
>         "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
> TD vcpu interrupts/exit/hypercall:
>         Handle various exits/hypercalls and allow interrupts to be injected so
>         that TD vcpu can continue running.
>         This layer starts with
>         "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
> 
> KVM MMU GPA shared bit:
>         Introduce framework to handle shared bit repurposed bit of GPA TDX
>         repurposed a bit of GPA to indicate shared or private. If it's shared,
>         it's the same as the conventional VMX EPT case.  VMM can access shared
>         guest pages.  If it's private, it's handled by Secure-EPT and the guest
>         page is encrypted.
>         This layer starts with
>         "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
> KVM TDP refactoring for TDX:
>         TDX Secure EPT requires different constants. e.g. initial value EPT
>         entry value etc. Various refactoring for those differences.
>         This layer starts with
>         "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
> KVM TDP MMU hooks:
>         Introduce framework to TDP MMU to add hooks in addition to direct EPT
>         access TDX added Secure EPT which is an enhancement to VMX EPT.  Unlike
>         conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
>         use TDX SEAMCALLs to operate on Secure EPT.
>         This layer starts with
>         "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
> KVM TDP MMU MapGPA:
>         Introduce framework to handle switching guest pages from private/shared
>         to shared/private.  For a given GPA, a guest page can be assigned to a
>         private GPA or a shared GPA exclusively.  With TDX MapGPA hypercall,
>         guest TD converts GPA assignments from private (or shared) to shared (or
>         private).
>         This layer starts with
>         "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
> 
> KVM guest private memory: (not shown in the above diagram)
> [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
> memory: https://lkml.org/lkml/2022/1/18/395
>         Guest private memory requires different memory management in KVM.  The
>         patch proposes a way for it.  Integration with TDX KVM.
> 
> (***)
> * TDX module
> A CPU-attested software module called the "TDX module" is designed to implement
> the TDX architecture, and it is loaded by the UEFI firmware today. It can be
> loaded by the kernel or driver at runtime, but in this patch series we assume
> that the TDX module is already loaded and initialized.
> 
> The TDX module provides two main new logical modes of operation built upon the
> new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
> architecture. TDX root mode is mostly identical to the VMX root operation mode,
> and the TDX functions (described later) are triggered by the new SEAMCALL
> instruction with the desired interface function selected by an input operand
> (leaf number, in RAX). TDX non-root mode is used for TD guest operation.  TDX
> non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
> operation (i.e. guest VM), with changes and restrictions to better assure that
> no other software or hardware has direct visibility of the TD memory and state.
> 
> TDX transitions between TDX root operation and TDX non-root operation include TD
> Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
> TDX root mode.  A TD Exit might be asynchronous, triggered by some external
> event (e.g., external interrupt or SMI) or an exception, or it might be
> synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
> 
> TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
> of the TDX interface functions as mentioned above, and "TDH" stands for Trust
> Domain Host. Those host-side TDX interface functions are categorized into
> various areas just for better organization, such as SYS (TDX module management),
> MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
> etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
> 
> TDCS (Trust Domain Control Structure) is the main control structure of a guest
> TD, and encrypted (using the guest TD's ephemeral private key).  At a high
> level, TDCS holds information for controlling TD operation as a whole,
> execution, EPTP, MSR bitmaps, etc that KVM needs to set it up.  Note that MSR
> bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
> same value for all VCPUs of the same TD.
> 
> Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
> TD VCPU.  It helps the TDX module control the operation of the VCPU, and holds
> the VCPU state while the VCPU is not running. TDVPS is opaque to software and
> DMA access, accessible only by using the TDX module interface functions (such as
> TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
> such as virtual APIC page, virtualization exception information, etc.
> 
> Several VMX control structures (such as Shared EPT and Posted interrupt
> descriptor) are directly managed and accessed by the host VMM.  These control
> structures are pointed to by fields in the TD VMCS.
> 
> The above means that 1) KVM needs to allocate different data structures for TDs,
> 2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
> define TD-specific handling for others.  3) Redirect operations to .  3)
> Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
> tdx_callback() else vmx_callback();".
> 
> *TD Private Memory
> TD private memory is designed to hold TD private content, encrypted by the CPU
> using the TD ephemeral key. An encryption engine holds a table of encryption
> keys, and an encryption key is selected for each memory transaction based on a
> Host Key Identifier (HKID). By design, the host VMM does not have access to the
> encryption keys.
> 
> In the first generation of MKTME, HKID is "stolen" from the physical address by
> allocating a configurable number of bits from the top of the physical
> address. The HKID space is partitioned into shared HKIDs for legacy MKTME
> accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
> HKID on the host so that MKTME can be opaque or bypassed on the host.
> 
> During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
> as either shared or private, based on the value of a new SHARED bit in the Guest
> Physical Address (GPA).  The CPU translates shared GPAs using the usual VMX EPT
> (Extended Page Table) or "Shared EPT" (in this document), which resides in host
> VMM memory. The Shared EPT is directly managed by the host VMM - the same as
> with the current VMX. Since guest TDs usually require I/O, and the data exchange
> needs to be done via shared memory, thus KVM needs to use the current EPT
> functionality even for TDs.
> 
> * Secure EPT and Minoring using the TDP code
> The CPU translates private GPAs using a separate Secure EPT.  The Secure EPT
> pages are encrypted and integrity-protected with the TD's ephemeral private
> key.  Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
> interface functions, and thus conceptually Secure EPT is a subset of EPT (why
> "subset"). Since execution of such interface functions takes much longer time
> than accessing memory directly, in KVM we use the existing TDP code to minor the
> Secure EPT for the TD.
> 
> This way, we can effectively walk Secure EPT without using the TDX interface
> functions.
> 
> * VM life cycle and TDX specific operations
> The userspace VMM, such as QEMU, needs to build and treat TDs differently.  For
> example, a TD needs to boot in private memory, and the host software cannot copy
> the initial image to private memory.
> 
> * TSC Virtualization
> The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
> (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
> by TD configuration, i.e. when the TD is created, not per VCPU.  The current KVM
> owns TSC virtualization for VMs, but the TDX module does for TDs.
> 
> * MCE support for TDs
> The TDX module doesn't allow VMM to inject MCE.  Instead PV way is needed for TD
> to communicate with VMM.  For now, KVM silently ignores MCE request by VMM.  MSRs
> related to MCE (e.g, MCE bank registers) can be naturally emulated by
> paravirtualizing MSR access.
> 
> [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> available.
> 
> * Restrictions or future work
> Some features are not included to reduce patch size.  Those features are
> addressed as future independent patch series.
> - large page (2M, 1G)
> - qemu gdb stub
> - guest PMU
> - and more
> 
> * Prerequisites
> It's required to load the TDX module and initialize it.  It's out of the scope
> of this patch series.  Another independent patch for the common x86 code is
> planned.  It defines CONFIG_INTEL_TDX_HOST and this patch series uses
> CONFIG_INTEL_TDX_HOST.  It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
> module is initialized and ready for KVM to use the TDX module APIs for TDX guest
> life cycle like tdh.mng.init are ready to use.
> 
> Concretely Global initialization, LP (Logical Processor) initialization, global
> configuration, the key configuration, and TDMR and PAMT initialization are done.
> The state of the TDX module is SYS_READY.  Please refer to the TDX module
> specification, the chapter Intel TDX Module Lifecycle State Machine
> 
> ** Detecting the TDX module readiness.
> TDX host patch series implements the detection of the TDX module availability
> and its initialization so that KVM can use it.  Also it manages Host KeyID
> (HKID) assigned to guest TD.
> The assumed APIs the TDX host patch series provides are
> - int seamrr_enabled()
>   Check if required cpu feature (SEAM mode) is available. This only check CPU
>   feature availability.  At this point, the TDX module may not be ready for KVM
>   to use.
> - int init_tdx(void);
>   Initialization of TDX module so that the TDX module is ready for KVM to use.
> - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
>   Return the system wide information about the TDX module.  NULL if the TDX
>   isn't initialized.
> - u32 tdx_get_global_keyid(void);
>   Return global key id that is used for the TDX module itself.
> - int tdx_keyid_alloc(void);
>   Allocate HKID for guest TD.
> - void tdx_keyid_free(int keyid);
>   Free HKID for guest TD.
> 
> (****)
> * TDX KVM high-level design
> - Host key ID management
> Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
> It is assumed The TDX host patch series implements necessary functions,
> u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
> void tdx_keyid_free(int keyid).
> 
> - Data structures and VM type
> Because TDX is different from VMX, define its own VM/VCPU structures, struct
> kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx.  To
> identify the VM, introduce VM-type to specify which VM type, VMX (default) or
> TDX, is used.
> 
> - VM life cycle and TDX specific operations
> Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
> New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
> parameters, set initial guest memory and measurement.
> 
> The creation of TDX VM requires five additional operations in addition to the
> conventional VM creation.
>   - Get KVM system capability to check if TDX VM type is supported
>   - VM creation (KVM_CREATE_VM)
>   - New: Get the TDX specific system parameters.  KVM_TDX_GET_CAPABILITY.
>   - New: Set TDX specific VM parameters.  KVM_TDX_INIT_VM.
>   - VCPU creation (KVM_CREATE_VCPU)
>   - New: Set TDX specific VCPU parameters.  KVM_TDX_INIT_VCPU.
>   - New: Initialize guest memory as boot state and extend the measurement with
>     the memory.  KVM_TDX_INIT_MEM_REGION.
>   - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
>     TDX VM contents.
>   - VCPU RUN (KVM_VCPU_RUN)
> 
> - Protected guest state
> Because the guest state (CPU state and guest memory) is protected, the KVM VMM
> can't operate on them.  For example, accessing CPU registers, injecting
> exceptions, and accessing guest memory.  Those operations are handled as
> silently ignored, returning zero or initial reset value when it's requested via
> KVM API ioctls.
> 
>     VM/VCPU state and callbacks for TDX specific operations.
>     Define tdx specific VM state and VCPU state instead of VMX ones.  Redirect
>     operations to TDX specific callbacks.  "if (tdx) tdx_op() else vmx_op()".
> 
>     Operations on the CPU state
>     silently ignore operations on the guest state.  For example, the write to
>     CPU registers is ignored and the read from CPU registers returns 0.
> 
>     . ignore access to CPU registers except for allowed ones.
>     . TSC: add a check if tsc is immutable and return an error.  Because the KVM
>       implementation updates the internal tsc state and it's difficult to back
>       out those changes.  Instead, skip the logic.
>     . dirty logging: add check if dirty logging is supported.
>     . exceptions/SMI/MCE/SIPI/INIT: silently ignore
> 
>     Note: virtual external interrupt and NMI can be injected into TDX guests.
> 
> - KVM MMU integration
> One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
> the guest physical address is private (the bit is cleared) or shared (the bit is
> set).  The bits are called stolen bits.
> 
>   - Stolen bits framework
>     systematically tracks which guest physical address, shared or private, is
>     used.
> 
>   - Shared EPT and secure EPT
>     There are two EPTs. Shared EPT (the conventional one) and Secure
>     EPT(the new one). Shared EPT is handled the same for the stolen
>     bit set.  Secure EPT points to private guest pages.  To resolve
>     EPT violation, KVM walks one of two EPTs based on faulted GPA.
>     Because it's costly to access secure EPT during walking EPTs with
>     SEAMCALLs for the private guest physical address, another private
>     EPT is used as a shadow of Secure-EPT with the existing logic at
>     the cost of extra memory.
> 
> The following depicts the relationship.
> 
>                     KVM                             |       TDX module
>                      |                              |           |
>         -------------+----------                    |           |
>         |                      |                    |           |
>         V                      V                    |           |
>      shared GPA           private GPA               |           |
>   CPU shared EPT pointer  KVM private EPT pointer   |  CPU secure EPT pointer
>         |                      |                    |           |
>         |                      |                    |           |
>         V                      V                    |           V
>   shared EPT                private EPT--------mirror----->Secure EPT
>         |                      |                    |           |
>         |                      \--------------------+------\    |
>         |                                           |      |    |
>         V                                           |      V    V
>   shared guest page                                 |    private guest page
>                                                     |
>                                                     |
>                               non-encrypted memory  |    encrypted memory
>                                                     |
> 
>   - Operating on Secure EPT
>     Use the TDX module APIs to operate on Secure EPT.  To call the TDX API
>     during resolving EPT violation, add hooks to additional operation and wiring
>     it to TDX backend.
> 
> * References
> 
> [1] TDX specification
>    https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> [2] Intel Trust Domain Extensions (Intel TDX)
>    https://cdrdv2.intel.com/v1/dl/getContent/726790
> [3] Intel CPU Architectural Extensions Specification
>    https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> [4] Intel TDX Module 1.0 Specification
>    https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
> [5] Intel TDX Loader Interface Specification
>   https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> [6] Intel TDX Guest-Hypervisor Communication Interface
>    https://cdrdv2.intel.com/v1/dl/getContent/726790
> [7] Intel TDX Virtual Firmware Design Guide
>    https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
> [8] intel public github
>    kvm TDX branch: https://github.com/intel/tdx/tree/kvm
>    TDX guest branch: https://github.com/intel/tdx/tree/guest
>    qemu TDX https://github.com/intel/qemu-tdx
> [9] TDVF
>     https://github.com/tianocore/edk2-staging/tree/TDVF
>     This was merged into EDK2 main branch. https://github.com/tianocore/edk2
> 
> Chao Gao (3):
>   KVM: x86: Move check_processor_compatibility from init ops to runtime
>     ops
>   Partially revert "KVM: Pass kvm_init()'s opaque param to additional
>     arch funcs"
>   KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
>     wrmsr
> 
> Isaku Yamahata (72):
>   KVM: Refactor CPU compatibility check on module initialiization
>   x86/virt/vmx/tdx: export platform_tdx_enabled()
>   KVM: TDX: Detect CPU feature on kernel module initialization
>   KVM: x86: Refactor KVM VMX module init/exit functions
>   KVM: TDX: Add placeholders for TDX VM/vcpu structure
>   x86/virt/tdx: Add a helper function to return system wide info about
>     TDX module
>   KVM: TDX: Initialize TDX module when loading kvm_intel.ko
>   KVM: TDX: Make TDX VM type supported
>   [MARKER] The start of TDX KVM patch series: TDX architectural
>     definitions
>   KVM: TDX: Define TDX architectural definitions
>   KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
>   KVM: TDX: Add helper functions to print TDX SEAMCALL error
>   [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
>   x86/cpu: Add helper functions to allocate/free TDX private host key id
>   KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
>   KVM: TDX: Make pmu_intel.c ignore guest TD case
>   [MARKER] The start of TDX KVM patch series: TD vcpu
>     creation/destruction
>   KVM: TDX: allocate/free TDX vcpu structure
>   KVM: TDX: allocate/free TDX vcpu structure
>   [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
>   KVM: x86/mmu: introduce config for PRIVATE KVM MMU
>   [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
>     TDX
>   KVM: x86/mmu: Disallow fast page fault on private GPA
>   KVM: VMX: Introduce test mode related to EPT violation VE
>   [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
>   KVM: x86/mmu: Focibly use TDP MMU for TDX
>   KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
>   KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
>   KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
>   [MARKER] The start of TDX KVM patch series: TDX EPT violation
>   KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
>   KVM: TDX: don't request KVM_REQ_APIC_PAGE_RELOAD
>   KVM: TDX: TDP MMU TDX support
>   [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
>   KVM: x86/mmu: steal software usable git to record if GFN is for shared
>     or not
>   KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
>   [MARKER] The start of TDX KVM patch series: TD finalization
>   KVM: TDX: Create initial guest memory
>   KVM: TDX: Finalize VM initialization
>   [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
>   KVM: TDX: Add helper assembly function to TDX vcpu
>   KVM: TDX: Implement TDX vcpu enter/exit path
>   KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
>   KVM: TDX: restore host xsave state when exit from the guest TD
>   KVM: TDX: restore user ret MSRs
>   [MARKER] The start of TDX KVM patch series: TD vcpu
>     exits/interrupts/hypercalls
>   KVM: TDX: complete interrupts after tdexit
>   KVM: TDX: restore debug store when TD exit
>   KVM: TDX: handle vcpu migration over logical processor
>   KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
>     behavior
>   KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
>   KVM: TDX: Implement interrupt injection
>   KVM: TDX: Implements vcpu request_immediate_exit
>   KVM: TDX: Implement methods to inject NMI
>   KVM: TDX: Add a place holder to handle TDX VM exit
>   KVM: TDX: handle EXIT_REASON_OTHER_SMI
>   KVM: TDX: handle ept violation/misconfig exit
>   KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
>   KVM: TDX: Add a place holder for handler of TDX hypercalls
>     (TDG.VP.VMCALL)
>   KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
>   KVM: TDX: Handle TDX PV CPUID hypercall
>   KVM: TDX: Handle TDX PV HLT hypercall
>   KVM: TDX: Handle TDX PV port io hypercall
>   KVM: TDX: Implement callbacks for MSR operations for TDX
>   KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
>   KVM: TDX: Handle TDX PV report fatal error hypercall
>   KVM: TDX: Handle TDX PV map_gpa hypercall
>   KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
>   KVM: TDX: Silently discard SMI request
>   KVM: TDX: Silently ignore INIT/SIPI
>   Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
>   KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
> 
> Rick Edgecombe (1):
>   KVM: x86/mmu: Add address conversion functions for TDX shared bits
> 
> Sean Christopherson (25):
>   KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
>   KVM: Enable hardware before doing arch VM initialization
>   KVM: x86: Introduce vm_type to differentiate default VMs from
>     confidential VMs
>   KVM: TDX: Add TDX "architectural" error codes
>   KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
>   KVM: TDX: create/destroy VM structure
>   KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
>   KVM: TDX: Do TDX specific vcpu initialization
>   KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
>   KVM: x86/mmu: Allow non-zero value for non-present SPTE
>   KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
>   KVM: x86/mmu: Allow per-VM override of the TDP max page level
>   KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for
>     private mmu
>   KVM: x86/mmu: Disallow dirty logging for x86 TDX
>   KVM: VMX: Split out guts of EPT violation to common/exposed function
>   KVM: VMX: Move setting of EPT MMU masks to common VT-x code
>   KVM: TDX: Add load_mmu_pgd method for TDX
>   KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
>   KVM: TDX: Add support for find pending IRQ in a protected local APIC
>   KVM: x86: Assume timer IRQ was injected if APIC state is proteced
>   KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
>     argument
>   KVM: VMX: Move NMI/exception handler to common helper
>   KVM: x86: Split core of hypercall emulation to helper function
>   KVM: TDX: Handle TDX PV MMIO hypercall
>   KVM: TDX: Add methods to ignore accesses to CPU state
> 
> Xiaoyao Li (1):
>   KVM: TDX: initialize VM with TDX specific parameters
> 
>  Documentation/virt/kvm/api.rst                |   30 +-
>  .../virt/kvm/intel-tdx-layer-status.rst       |   33 +
>  Documentation/virt/kvm/intel-tdx.rst          |  381 +++
>  Documentation/virt/kvm/tdx-tdp-mmu.rst        |  466 ++++
>  arch/arm64/kvm/arm.c                          |    2 +-
>  arch/mips/kvm/mips.c                          |   14 +-
>  arch/powerpc/kvm/powerpc.c                    |    2 +-
>  arch/riscv/kvm/main.c                         |    2 +-
>  arch/s390/kvm/kvm-s390.c                      |    2 +-
>  arch/x86/events/intel/ds.c                    |    1 +
>  arch/x86/include/asm/kvm-x86-ops.h            |   10 +
>  arch/x86/include/asm/kvm_host.h               |   56 +-
>  arch/x86/include/asm/tdx.h                    |   67 +
>  arch/x86/include/asm/vmx.h                    |   14 +
>  arch/x86/include/uapi/asm/kvm.h               |   95 +
>  arch/x86/include/uapi/asm/vmx.h               |    5 +-
>  arch/x86/kvm/Kconfig                          |    4 +
>  arch/x86/kvm/Makefile                         |    3 +-
>  arch/x86/kvm/irq.c                            |    3 +
>  arch/x86/kvm/lapic.c                          |   37 +-
>  arch/x86/kvm/lapic.h                          |    2 +
>  arch/x86/kvm/mmu.h                            |   42 +-
>  arch/x86/kvm/mmu/mmu.c                        |  360 ++-
>  arch/x86/kvm/mmu/mmu_internal.h               |  123 +-
>  arch/x86/kvm/mmu/paging_tmpl.h                |    5 +-
>  arch/x86/kvm/mmu/spte.c                       |   46 +-
>  arch/x86/kvm/mmu/spte.h                       |   65 +-
>  arch/x86/kvm/mmu/tdp_iter.c                   |    1 +
>  arch/x86/kvm/mmu/tdp_iter.h                   |    5 +-
>  arch/x86/kvm/mmu/tdp_mmu.c                    |  690 ++++-
>  arch/x86/kvm/mmu/tdp_mmu.h                    |   12 +-
>  arch/x86/kvm/svm/svm.c                        |   13 +-
>  arch/x86/kvm/vmx/common.h                     |  174 ++
>  arch/x86/kvm/vmx/evmcs.c                      |    2 +-
>  arch/x86/kvm/vmx/evmcs.h                      |    2 +-
>  arch/x86/kvm/vmx/main.c                       | 1071 +++++++
>  arch/x86/kvm/vmx/pmu_intel.c                  |   39 +-
>  arch/x86/kvm/vmx/pmu_intel.h                  |   28 +
>  arch/x86/kvm/vmx/posted_intr.c                |   43 +-
>  arch/x86/kvm/vmx/posted_intr.h                |   13 +
>  arch/x86/kvm/vmx/tdx.c                        | 2465 +++++++++++++++++
>  arch/x86/kvm/vmx/tdx.h                        |  275 ++
>  arch/x86/kvm/vmx/tdx_arch.h                   |  157 ++
>  arch/x86/kvm/vmx/tdx_errno.h                  |   29 +
>  arch/x86/kvm/vmx/tdx_error.c                  |   22 +
>  arch/x86/kvm/vmx/tdx_ops.h                    |  188 ++
>  arch/x86/kvm/vmx/vmenter.S                    |  146 +
>  arch/x86/kvm/vmx/vmx.c                        |  737 ++---
>  arch/x86/kvm/vmx/vmx.h                        |   39 +-
>  arch/x86/kvm/vmx/x86_ops.h                    |  235 ++
>  arch/x86/kvm/x86.c                            |  148 +-
>  arch/x86/virt/vmx/tdx/seamcall.S              |    2 +
>  arch/x86/virt/vmx/tdx/tdx.c                   |   54 +-
>  arch/x86/virt/vmx/tdx/tdx.h                   |   52 -
>  include/linux/kvm_host.h                      |    4 +-
>  include/uapi/linux/kvm.h                      |    2 +
>  tools/arch/x86/include/uapi/asm/kvm.h         |   95 +
>  tools/include/uapi/linux/kvm.h                |    1 +
>  virt/kvm/kvm_main.c                           |   67 +-
>  59 files changed, 7877 insertions(+), 804 deletions(-)
>  create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
>  create mode 100644 Documentation/virt/kvm/intel-tdx.rst
>  create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
>  create mode 100644 arch/x86/kvm/vmx/common.h
>  create mode 100644 arch/x86/kvm/vmx/main.c
>  create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
>  create mode 100644 arch/x86/kvm/vmx/tdx.c
>  create mode 100644 arch/x86/kvm/vmx/tdx.h
>  create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
>  create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
>  create mode 100644 arch/x86/kvm/vmx/tdx_error.c
>  create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
>  create mode 100644 arch/x86/kvm/vmx/x86_ops.h
> 
> -- 
> 2.25.1
>
Chao Gao July 12, 2022, 5:07 a.m. UTC | #2
On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
>Hi. Because my description on large page support was terse, I wrote up more
>detailed one.  Any feedback/thoughts on large page support?
>
>TDP MMU large page support design
>
>Two main discussion points
>* how to track page status. private vs shared, no-largepage vs can-be-largepage

...

>
>Tracking private/shared and large page mappable
>-----------------------------------------------
>VMM needs to track that page is mapped as private or shared at 4KB granularity.
>For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
>track the page can be mapped as a large page (regarding private/shared).  VMM
>updates it on MapGPA and references it on the EPT violation path. (****)

Isaku,

+ Peng Chao

Doesn't UPM guarantee that 2MB/1GB large page in CR3 should be either all
private or all shared?

KVM always retrieves the mapping level in CR3 and enforces that EPT's
page level is not greater than that in CR3. My point is if UPM already enforces
no mixed pages in a large page, then KVM needn't do that again (UPM can
be trusted).

Maybe I am misunderstanding something?
Chao Peng July 12, 2022, 10:49 a.m. UTC | #3
On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
> Hi. Because my description on large page support was terse, I wrote up more
> detailed one.  Any feedback/thoughts on large page support?
> 
> TDP MMU large page support design
> 
> Two main discussion points
> * how to track page status. private vs shared, no-largepage vs can-be-largepage
> * how to trigger merging mapping from 4KB/2MB to 2MB/1GB
> 
> Expected private-vs-shared page usage
> -------------------------------------
> On TD boot all pages are private and TD converts pages into shared if necessary.
> * Most of the guest pages remain private.
> * Only limited pages are converted at kernel boot
>   ** bounce buffer for IO (virt-io).  It's allocated as swiotlb.  Its size is
>      64MB or 6% of total guest memory.
>   ** KVM PV shared page. (the current guest TD doesn't use KVM PV shared page.)
> * Only a small number of pages are dynamically converted from private to shared
>   and vice versa.  This usage is very limited. e.g. GetQuote, the lack of
>   swiotlb buffer
> 
> 
> Theory of Secure-EPT operations related to large page
> -----------------------------------------------------
> TDX Secure-EPT has differences from VMX EPT.
> To add a page to Secure-EPT
> 
> * Here is the operation to resolve the EPT violation.
> 1. TD: Accepts GPA.  TD needs to accept GPA before accessing GPA because TD
>    needs to detect that VMM unmaps GPA and maps GPA again.
> 2. EPT violation is triggered.  TD exit to VMM.
> 3. VMM: allocate a page for GPA and TDH.MEM.PAGE.AUG it to GPA.  Resume TD vcpu.
>    (3a. TD: #VE<EPT violation> is injected.  #VE handler accepts the page)
> 4. TD: resume #VE and continue TD vcpu execution
> 
> TD may choose step 1. In that case, After step 3. #VE is injected into TD and,
> TD #VE handler needs to accept the page.
> 
> When adding a page to Secure-EPT again, the page contexts are cleared and the
> page is encrypted.  If a page is disassociated from Secure-EPT and added again,
> the page content is lost.
> 
> * TDG.VP.VMCALL<MapGPA> hypercall
> The page associated with GPA can be private or shared.  TD converts the GPA by
> TDG.VP.VMCALL<MapGPA> hypercall from private to shared or vice versa.  VMM
> tracks whether the given GPA is private or shared.
> 
> * mapping merge(promote)/split(demote)
> The page can be mapped as large page (2MB or 1GB) in addition to 4KB.  The
> mapping can be merged(4KB/2MB -> 2MB/1GB) or split(2MB/1GB -> 4KB/2MB) by TDX
> SEAMCALL TDH.MEM.PAGE.PROMOTE and TDH.MEM.PAGE.DEMOTE.
> The merge of mapping requires all the pages needs to be mapped, unlike VMX EPT
> because of encryption.  This implies the current KVM implementation doesn't work
> for TDX when merging mapping as follows
> 
> - EPT violation and host page is 2MB mappable.
>   some of the 4KB pages of the given 2MB page are already mapped, some not.
>   i.e. 2MB EPT -> 4KB EPT -> 4K pages
> - KVM page fault handler zap 2MB EPT entry and populate 2MB EPT entry
>   zap: 2MB EPT: non present
>   populate 2MB: -> 2MB page
> 
> If VMM zaps 2MB Secure-EPT entry, the page contents will be lost for TDX.
> Mapping merge requires all pages are already mapped.
> 
> Instead, the following steps are needed.
> - EPT violation and host page is 2MB mappable.
>   some of the 4KB pages of the given 2MB page are already mapped.  Some not.
>   i.e. 2MB EPT -> 4KB EPT -> 4K pages
> - VMM checks all 4KB GPAs are private. If not, it can't be mapped as a large page.
>   (****)
> - VMM checks all 4KB GPAs are already mapped.  If not, give up mapping merge.
>   (or map missing 4KB pages.)
> - mapping merge by TDH.MEM.PAGE.PROMOTE
> 
> The mapping split for TDX Secure-EPT works similarly to the VMX EPT case.
> 
> 
> EPT violation and MapGPA
> ------------------------
> - EPT violation is a fast path
> - MapGPA is not a fast path.
> => Keep the EPT violation path optimized and complicates the MapGPA path.  For
> (****) check, we don't want to scan the 4KB mapping on EPT violation.  Instead,
> the MapGPA path scans it and records the result as the page can be mapped as 2MB
> due to private/shared.

This sounds reasonable, Instead of tracking that in MapGPA,  maybe
KVM_MEMORY_ENCRYPT_{UN,}REG_REGION introduced in UPM v7 is a better
place to put the scan code in.

  https://lkml.org/lkml/2022/7/6/259

Both the MapGPA (explicit conversion) and the EPT violation (implicit
conversion) can cause invocation to these two ioctls and need update to
this info.

> 
> 
> Tracking private/shared and large page mappable
> -----------------------------------------------
> VMM needs to track that page is mapped as private or shared at 4KB granularity.
> For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
> track the page can be mapped as a large page (regarding private/shared).  VMM
> updates it on MapGPA and references it on the EPT violation path. (****)
> 
> For 4KB pages, 1 bit is needed. private or shared.  Let's call it shared-mask bit.
> For 2MB/1GB pages, 2 bit is needed. large page mappable or not. private or
> shared if mappable.  Let's call it no-largepage bit.

I'm just thinking maybe we don't need introduce new bits, instead we
reuse lpage_info where we already use it to track whether a page can be
mapped at specified page level in kvm_mmu_max_mapping_level(). Then in
the above two ioctls we do a scan for each level and update lpage_info.
For example, we should disallow_lpage if private/shared pages are mixed
in that page level.

It's however a bit tricky to manage lpage_info.disallow_lpage in these
two ioctls with current code. We can't simply do disallow_lpage++ and
disallow_lpage--. One possible solution can treat disallow_lpage as a
mask instead of a count. Then we define bits like below for use:
  - USER_GFN_UNALIGNED set when memslot user_address/private_offset/gfn
    is not aligned on the page level
  - PAGE_TRACKING set during page tracking
  - PRIVITE_SHARED_MIXED set when private/shared pages are mixed

In page fault handler the page can be mapped at that level only when all
bits are zero and in above two ioctls we just switch on/off bit
PRIVITE_SHARED_MIXED.

Currently UMP don't have this code yet, but can be added if feasible.

Chao
> 
> Option A.)
>   Allocate array for pages in struct kvm_arch_memory_slot on TD creation.
>   struct kvm_arch_memory_slot {
>     +struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
>   }
> 
>   pros:
>   +straight forward implementation
>   +SPTE_SHARED_MASK is not needed
>   cons:
>   -memory overhead is high
>   -not optimized for expected usage
>   -one more look-up on EPT violation
> 
> Option B.) Steal two software usable bits from SPTE and record them in SPTE.
>            SPTE_SHARED_MASK, SPTE_NOLARGE_PAGE_MASK
>   pros:
>   +optimized for EPT violation
>   cons:
>   -2bits used in SPTE entry
>   -complicates the MapGPA path.
> 
> Option C.) Steal one software usable bit from SPTE and record it in SPTE.
>            SPTE_SHARED_MASK
>            For 2MB/1GB, allocate bitmap in kvm_mmu_page.
>            struct kvm_mmu_page {
>              bitmap nolarge
>            }
>   pros:
>   +optimized for EPT violation
>   cons:
>   -complicates the MapGPA path.
>   -information is scattered in SPTE and struct kvm_mmu_page
> 
> 
> How to update those bits
> ------------------------
> - MapGPA
>   - at 4KB level, set or clear shared-mask bit.
>   - Scan 512 4KB bit, at 2MB level
>     - set or clear shared-mask bit, clear no-largepage bit or
>     - clear shared-mask bit, set no-largepage bit
>     - increment/decrement lpageinfo to prevent/allow large page
>   - similar for 1GB level
>   Note: This logic might a bit tricky.
> 
> - EPT violation
>   - If 2MB large page is allowed, check if no-largepage bit
>     - If no-largepage bit is set, => go down to 4KB page
>     - If no-largepage bit is cleared => try to map 2MB page
>       - If 4KB level is not mapped, map 2MB page
>       - If some 4KB level is already mapped, go down to 4KB.
>         Don't try to merge mapping. Or it's possible to try to merge mapping.
>   Note: 512 4KB entry scanning is not done at EPT violation because it's fast
>         path.
> 
> 
> Map merging
> -----------
> Map merging is necessary for TD migration. (Map split is the easy part.)  The
> current KVM implementation zaps the range (mmu notification or lpage recovery
> worker) and expects large page mapping on the next EPT violation.
> 
> Option A.) Keep the code similar to map merging logic.
> Zap 2MB EPT entry in some sense and trigger map merging logic on the next EPT
> violation.  To keep encrypted page contents, zapped EPT entries needs to keep
> the page.  Steal one more bits from SPTE. SPTE_PRIVATE_BLOCKED_MASK.
> It means that the page is zapped from SPTE. but it still alive and references
> page.
> 
> Option B.) In the callback, directly merge mapping somehow.  In this case, mmu
> notifier usage doesn't make sense.
> 
> NOTE:
> - Implement map merging in MapGPA. This doesn't work for dirty page logging.
> - We can utilize kvm_nx_lpage_recovery_worker
> - We can utilize THP. Probably doesn't work well for fd-based private memory.
> 
> Thanks,
> Isaku Yamayhata
> 
> On Mon, Jun 27, 2022 at 02:52:52PM -0700,
> isaku.yamahata@intel.com wrote:
> 
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > KVM TDX basic feature support
> > 
> > Hello.  This is v7 the patch series vof KVM TDX support.
> > This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
> > The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
> > How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM
> > 
> > Major changes from v6:
> > - rebased to v5.19 base
> > 
> > TODO:
> > - integrate fd-based guest memory. As the discussion is still on-going, I
> >   intentionally dropped fd-based guest memory support yet.  The integration can
> >   be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
> > - 2M large page support. It's work-in-progress.
> > For large page support, there are several design choices. Here is the design options.
> > Any thoughts/feedback?
> > 
> > KVM MMU Large page support for TDX
> > 
> > * What needs to be done
> > - Track private or shared of each page size (4KB, 2MB, 1GB) based on
> >   TDG.VP.VMCALL<MapGPA>.  For large pages(2MB, 1GB), it can be mixed (some
> >   lower-size pages are private and some shared.)  In this case, the page can't
> >   be large.
> > - if necessary, split large page on TDG.VP.VMCALL<MapGPA>
> >   (split on dirty page tracking is future work)
> > - resolving KVM page fault
> >   When resolving a private page and the page is large in the host, GPA can be
> >   resolved as a large page in Secure-EPT.  Even if the page is large on the host
> >   side, sometimes a 4KB page can be resolved because it's up to guest TD to
> >   accept at 4KB, 2MB, or 1GB.
> > - collapsing pages into a large page.
> >   At this point, it's okay to not implement this.  When dirty page tracking is
> >   supported, this needs to be supported.
> >   - On MapGPA, the page can be collapsed into a large page
> >   - handle zapping SPTE and try to collapse the pages on the next KVM page fault
> >     Unlike the EPT case, some trick is needed.
> > - For performance, optimize KVM page fault path at the cost of complicating
> >   MapGPA path.
> > 
> > * options to track private or shared
> > At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
> > 1GB case). For 4KB each page, 1 bit per page is needed. private or shared.  For
> > large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
> > mixed).  When resolving KVM page fault, we don't want to check the lower-size
> > pages to check if the given GPA can be a large for performance.  On MapGPA check
> > it instead.
> > 
> > Option A). enhance kvm_arch_memory_slot
> >   enum kvm_page_type {
> >        KVM_PAGE_TYPE_INVALID,
> >        KVM_PAGE_TYPE_SHARED,
> >        KVM_PAGE_TYPE_PRIVATE,
> >        KVM_PAGE_TYPE_MIXED,
> >   };
> > 
> >   struct kvm_page_attr {
> >        enum kvm_page_type type;
> >   };
> > 
> >  struct kvm_arch_memory_slot {
> >  +      struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
> > 
> > Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
> > If !SPTE_MIXED_MASK, it can be large page.
> > 
> > Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
> > kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.
> > 
> > 
> > * comparison
> > A).
> > + straightforward to implement
> > + SPTE_SHARED_MASK isn't needed
> > - memory overhead compared to B). or C).
> > - more memory reference on KVM page fault
> > 
> > B).
> > + simpler than C) (complex than A)?)
> > + efficient on KVM page fault. (only SPTE reference)
> > + low memory overhead
> > - Waste precious SPTE bits.
> > 
> > C).
> > + efficient on KVM page fault. (only SPTE reference)
> > + low memory overhead
> > - complicates MapGPA
> > - scattered data structure
> > 
> > Thanks,
> > Isaku Yamahata
> > 
> > Changes from v6:
> > - rebased to v5.19
> > 
> > Changes from v5:
> > - export __seamcall and use it
> > - move mutex lock from callee function of smp_call_on_cpu to the caller.
> > - rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
> > - updated comment
> > - drop the use of tdh_mng_key.reclaimid(): as the function is for backward
> >   compatibility to only return success
> > - struct kvm_tdx_cmd: metadata => flags, added __u64 error.
> > - make this ioctl systemwide ioctl
> > - ABI change to struct kvm_init_vm
> > - guest_tsc_khz: use kvm->arch.default_tsc_khz
> > - rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
> > - drop exporting kvm_set_tsc_khz().
> > - fix kvm_tdp_page_fault() for mtrr emulation
> > - rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
> > - drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
> >   keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
> > - update commit message
> > - rename shadow_init_value => shadow_nonprsent_value
> > - added ept_violation_ve_test mode
> > - shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
> > - legacy MMU case
> >   => - mmu_topup_shadow_page_cache(), kvm_mmu_create()
> >      - FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> > - #VE warning:
> > - rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
> > - merge into Like we discussed, this patch should be merged with patch
> >   "KVM: x86/mmu: Allow non-zero init value for shadow PTE".
> > - fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
> > - introduce kvm_gfn_for_root(kvm, root, gfn)
> > - add only_shared argument to kvm_tdp_mmu_handle_gfn()
> > - use kvm_arch_dirty_log_supported()
> > - rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
> > - rename: is_private_prohibit_spte() => spte_shared_mask()
> > - fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
> > - dropped this patch as the change was merged into kvm/queue
> > - update vt_apicv_post_state_restore()
> > - use is_64_bit_hypercall()
> > - comment: expand MSMI -> Machine Check System Management Interrupt
> > - fixed TDX_SEPT_PFERR
> > - tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
> > - rename tdmvcall_exit_readon() => tdvmcall_leaf()
> > - remove optional zero check of argument.
> > - do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
> >    in kvm_vcpu_ioctl_smi and __apic_accept_irq.
> > - WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
> > - introduce vcpu_deliver_init to x86_ops
> > - sprinkeled KVM_BUG_ON()
> > 
> > Changes from v4:
> > - rebased to TDX host kernel patch series.
> > - include all the patches to make this patch series working.
> > - add [MARKER] patches to mark the patch layer clear.
> > 
> > ---
> > * What's TDX?
> > TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
> > Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
> > Domain (TD) for confidential computing.
> > 
> > A TD runs in a CPU mode that is designed to protect the confidentiality of its
> > memory contents and its CPU state from any other software, including the hosting
> > Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
> > 
> > We have more detailed explanations below (***).
> > We have the high-level design of TDX KVM below (****).
> > 
> > In this patch series, we use "TD" or "guest TD" to differentiate it from the
> > current "VM" (Virtual Machine), which is supported by KVM today.
> > 
> > 
> > * The organization of this patch series
> > This patch series is on top of the patches series "TDX host kernel support":
> > https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
> > 
> > this patch series is available at
> > https://github.com/intel/tdx/releases/tag/kvm-upstream
> > The corresponding patches to qemu are available at
> > https://github.com/intel/qemu-tdx/commits/tdx-upstream
> > 
> > The relations of the layers are depicted as follows.
> > The arrows below show the order of patch reviews we would like to have.
> > 
> > The below layers are chosen so that the device model, for example, qemu can
> > exercise each layering step by step.  Check if TDX is supported, create TD VM,
> > create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
> > vcpu exits/hypercalls/interrupts to run TD fully.
> > 
> >   TDX vcpu
> >   interrupt/exits/hypercall<------------\
> >         ^                               |
> >         |                               |
> >   TD finalization                       |
> >         ^                               |
> >         |                               |
> >   TDX EPT violation<------------\       |
> >         ^                       |       |
> >         |                       |       |
> >   TD vcpu enter/exit            |       |
> >         ^                       |       |
> >         |                       |       |
> >   TD vcpu creation/destruction  |       \-------KVM TDP MMU MapGPA
> >         ^                       |                       ^
> >         |                       |                       |
> >   TD VM creation/destruction    \---------------KVM TDP MMU hooks
> >         ^                                               ^
> >         |                                               |
> >   TDX architectural definitions                 KVM TDP refactoring for TDX
> >         ^                                               ^
> >         |                                               |
> >    TDX, VMX    <--------TDX host kernel         KVM MMU GPA stolen bits
> >    coexistence          support
> > 
> > 
> > The followings are explanations of each layer.  Each layer has a dummy commit
> > that starts with [MARKER] in subject.  It is intended to help to identify where
> > each layer starts.
> > 
> > TDX host kernel support:
> >         https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
> >         The guts of system-wide initialization of TDX module.  There is an
> >         independent patch series for host x86.  TDX KVM patches call functions
> >         this patch series provides to initialize the TDX module.
> > 
> > TDX, VMX coexistence:
> >         Infrastructure to allow TDX to coexist with VMX and trigger the
> >         initialization of the TDX module.
> >         This layer starts with
> >         "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
> > TDX architectural definitions:
> >         Add TDX architectural definitions and helper functions
> >         This layer starts with
> >         "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
> > TD VM creation/destruction:
> >         Guest TD creation/destroy allocation and releasing of TDX specific vm
> >         and vcpu structure.  Create an initial guest memory image with TDX
> >         measurement.
> >         This layer starts with
> >         "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
> > TD vcpu creation/destruction:
> >         guest TD creation/destroy Allocation and releasing of TDX specific vm
> >         and vcpu structure.  Create an initial guest memory image with TDX
> >         measurement.
> >         This layer starts with
> >         "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
> > TDX EPT violation:
> >         Create an initial guest memory image with TDX measurement.  Handle
> >         secure EPT violations to populate guest pages with TDX SEAMCALLs.
> >         This layer starts with
> >         "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
> > TD vcpu enter/exit:
> >         Allow TDX vcpu to enter into TD and exit from TD.  Save CPU state before
> >         entering into TD.  Restore CPU state after exiting from TD.
> >         This layer starts with
> >         "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
> > TD vcpu interrupts/exit/hypercall:
> >         Handle various exits/hypercalls and allow interrupts to be injected so
> >         that TD vcpu can continue running.
> >         This layer starts with
> >         "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
> > 
> > KVM MMU GPA shared bit:
> >         Introduce framework to handle shared bit repurposed bit of GPA TDX
> >         repurposed a bit of GPA to indicate shared or private. If it's shared,
> >         it's the same as the conventional VMX EPT case.  VMM can access shared
> >         guest pages.  If it's private, it's handled by Secure-EPT and the guest
> >         page is encrypted.
> >         This layer starts with
> >         "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
> > KVM TDP refactoring for TDX:
> >         TDX Secure EPT requires different constants. e.g. initial value EPT
> >         entry value etc. Various refactoring for those differences.
> >         This layer starts with
> >         "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
> > KVM TDP MMU hooks:
> >         Introduce framework to TDP MMU to add hooks in addition to direct EPT
> >         access TDX added Secure EPT which is an enhancement to VMX EPT.  Unlike
> >         conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
> >         use TDX SEAMCALLs to operate on Secure EPT.
> >         This layer starts with
> >         "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
> > KVM TDP MMU MapGPA:
> >         Introduce framework to handle switching guest pages from private/shared
> >         to shared/private.  For a given GPA, a guest page can be assigned to a
> >         private GPA or a shared GPA exclusively.  With TDX MapGPA hypercall,
> >         guest TD converts GPA assignments from private (or shared) to shared (or
> >         private).
> >         This layer starts with
> >         "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
> > 
> > KVM guest private memory: (not shown in the above diagram)
> > [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
> > memory: https://lkml.org/lkml/2022/1/18/395
> >         Guest private memory requires different memory management in KVM.  The
> >         patch proposes a way for it.  Integration with TDX KVM.
> > 
> > (***)
> > * TDX module
> > A CPU-attested software module called the "TDX module" is designed to implement
> > the TDX architecture, and it is loaded by the UEFI firmware today. It can be
> > loaded by the kernel or driver at runtime, but in this patch series we assume
> > that the TDX module is already loaded and initialized.
> > 
> > The TDX module provides two main new logical modes of operation built upon the
> > new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
> > architecture. TDX root mode is mostly identical to the VMX root operation mode,
> > and the TDX functions (described later) are triggered by the new SEAMCALL
> > instruction with the desired interface function selected by an input operand
> > (leaf number, in RAX). TDX non-root mode is used for TD guest operation.  TDX
> > non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
> > operation (i.e. guest VM), with changes and restrictions to better assure that
> > no other software or hardware has direct visibility of the TD memory and state.
> > 
> > TDX transitions between TDX root operation and TDX non-root operation include TD
> > Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
> > TDX root mode.  A TD Exit might be asynchronous, triggered by some external
> > event (e.g., external interrupt or SMI) or an exception, or it might be
> > synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
> > 
> > TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
> > of the TDX interface functions as mentioned above, and "TDH" stands for Trust
> > Domain Host. Those host-side TDX interface functions are categorized into
> > various areas just for better organization, such as SYS (TDX module management),
> > MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
> > etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
> > 
> > TDCS (Trust Domain Control Structure) is the main control structure of a guest
> > TD, and encrypted (using the guest TD's ephemeral private key).  At a high
> > level, TDCS holds information for controlling TD operation as a whole,
> > execution, EPTP, MSR bitmaps, etc that KVM needs to set it up.  Note that MSR
> > bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
> > same value for all VCPUs of the same TD.
> > 
> > Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
> > TD VCPU.  It helps the TDX module control the operation of the VCPU, and holds
> > the VCPU state while the VCPU is not running. TDVPS is opaque to software and
> > DMA access, accessible only by using the TDX module interface functions (such as
> > TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
> > such as virtual APIC page, virtualization exception information, etc.
> > 
> > Several VMX control structures (such as Shared EPT and Posted interrupt
> > descriptor) are directly managed and accessed by the host VMM.  These control
> > structures are pointed to by fields in the TD VMCS.
> > 
> > The above means that 1) KVM needs to allocate different data structures for TDs,
> > 2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
> > define TD-specific handling for others.  3) Redirect operations to .  3)
> > Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
> > tdx_callback() else vmx_callback();".
> > 
> > *TD Private Memory
> > TD private memory is designed to hold TD private content, encrypted by the CPU
> > using the TD ephemeral key. An encryption engine holds a table of encryption
> > keys, and an encryption key is selected for each memory transaction based on a
> > Host Key Identifier (HKID). By design, the host VMM does not have access to the
> > encryption keys.
> > 
> > In the first generation of MKTME, HKID is "stolen" from the physical address by
> > allocating a configurable number of bits from the top of the physical
> > address. The HKID space is partitioned into shared HKIDs for legacy MKTME
> > accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
> > HKID on the host so that MKTME can be opaque or bypassed on the host.
> > 
> > During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
> > as either shared or private, based on the value of a new SHARED bit in the Guest
> > Physical Address (GPA).  The CPU translates shared GPAs using the usual VMX EPT
> > (Extended Page Table) or "Shared EPT" (in this document), which resides in host
> > VMM memory. The Shared EPT is directly managed by the host VMM - the same as
> > with the current VMX. Since guest TDs usually require I/O, and the data exchange
> > needs to be done via shared memory, thus KVM needs to use the current EPT
> > functionality even for TDs.
> > 
> > * Secure EPT and Minoring using the TDP code
> > The CPU translates private GPAs using a separate Secure EPT.  The Secure EPT
> > pages are encrypted and integrity-protected with the TD's ephemeral private
> > key.  Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
> > interface functions, and thus conceptually Secure EPT is a subset of EPT (why
> > "subset"). Since execution of such interface functions takes much longer time
> > than accessing memory directly, in KVM we use the existing TDP code to minor the
> > Secure EPT for the TD.
> > 
> > This way, we can effectively walk Secure EPT without using the TDX interface
> > functions.
> > 
> > * VM life cycle and TDX specific operations
> > The userspace VMM, such as QEMU, needs to build and treat TDs differently.  For
> > example, a TD needs to boot in private memory, and the host software cannot copy
> > the initial image to private memory.
> > 
> > * TSC Virtualization
> > The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
> > (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
> > by TD configuration, i.e. when the TD is created, not per VCPU.  The current KVM
> > owns TSC virtualization for VMs, but the TDX module does for TDs.
> > 
> > * MCE support for TDs
> > The TDX module doesn't allow VMM to inject MCE.  Instead PV way is needed for TD
> > to communicate with VMM.  For now, KVM silently ignores MCE request by VMM.  MSRs
> > related to MCE (e.g, MCE bank registers) can be naturally emulated by
> > paravirtualizing MSR access.
> > 
> > [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> > available.
> > 
> > * Restrictions or future work
> > Some features are not included to reduce patch size.  Those features are
> > addressed as future independent patch series.
> > - large page (2M, 1G)
> > - qemu gdb stub
> > - guest PMU
> > - and more
> > 
> > * Prerequisites
> > It's required to load the TDX module and initialize it.  It's out of the scope
> > of this patch series.  Another independent patch for the common x86 code is
> > planned.  It defines CONFIG_INTEL_TDX_HOST and this patch series uses
> > CONFIG_INTEL_TDX_HOST.  It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
> > module is initialized and ready for KVM to use the TDX module APIs for TDX guest
> > life cycle like tdh.mng.init are ready to use.
> > 
> > Concretely Global initialization, LP (Logical Processor) initialization, global
> > configuration, the key configuration, and TDMR and PAMT initialization are done.
> > The state of the TDX module is SYS_READY.  Please refer to the TDX module
> > specification, the chapter Intel TDX Module Lifecycle State Machine
> > 
> > ** Detecting the TDX module readiness.
> > TDX host patch series implements the detection of the TDX module availability
> > and its initialization so that KVM can use it.  Also it manages Host KeyID
> > (HKID) assigned to guest TD.
> > The assumed APIs the TDX host patch series provides are
> > - int seamrr_enabled()
> >   Check if required cpu feature (SEAM mode) is available. This only check CPU
> >   feature availability.  At this point, the TDX module may not be ready for KVM
> >   to use.
> > - int init_tdx(void);
> >   Initialization of TDX module so that the TDX module is ready for KVM to use.
> > - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
> >   Return the system wide information about the TDX module.  NULL if the TDX
> >   isn't initialized.
> > - u32 tdx_get_global_keyid(void);
> >   Return global key id that is used for the TDX module itself.
> > - int tdx_keyid_alloc(void);
> >   Allocate HKID for guest TD.
> > - void tdx_keyid_free(int keyid);
> >   Free HKID for guest TD.
> > 
> > (****)
> > * TDX KVM high-level design
> > - Host key ID management
> > Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
> > It is assumed The TDX host patch series implements necessary functions,
> > u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
> > void tdx_keyid_free(int keyid).
> > 
> > - Data structures and VM type
> > Because TDX is different from VMX, define its own VM/VCPU structures, struct
> > kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx.  To
> > identify the VM, introduce VM-type to specify which VM type, VMX (default) or
> > TDX, is used.
> > 
> > - VM life cycle and TDX specific operations
> > Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
> > New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
> > parameters, set initial guest memory and measurement.
> > 
> > The creation of TDX VM requires five additional operations in addition to the
> > conventional VM creation.
> >   - Get KVM system capability to check if TDX VM type is supported
> >   - VM creation (KVM_CREATE_VM)
> >   - New: Get the TDX specific system parameters.  KVM_TDX_GET_CAPABILITY.
> >   - New: Set TDX specific VM parameters.  KVM_TDX_INIT_VM.
> >   - VCPU creation (KVM_CREATE_VCPU)
> >   - New: Set TDX specific VCPU parameters.  KVM_TDX_INIT_VCPU.
> >   - New: Initialize guest memory as boot state and extend the measurement with
> >     the memory.  KVM_TDX_INIT_MEM_REGION.
> >   - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
> >     TDX VM contents.
> >   - VCPU RUN (KVM_VCPU_RUN)
> > 
> > - Protected guest state
> > Because the guest state (CPU state and guest memory) is protected, the KVM VMM
> > can't operate on them.  For example, accessing CPU registers, injecting
> > exceptions, and accessing guest memory.  Those operations are handled as
> > silently ignored, returning zero or initial reset value when it's requested via
> > KVM API ioctls.
> > 
> >     VM/VCPU state and callbacks for TDX specific operations.
> >     Define tdx specific VM state and VCPU state instead of VMX ones.  Redirect
> >     operations to TDX specific callbacks.  "if (tdx) tdx_op() else vmx_op()".
> > 
> >     Operations on the CPU state
> >     silently ignore operations on the guest state.  For example, the write to
> >     CPU registers is ignored and the read from CPU registers returns 0.
> > 
> >     . ignore access to CPU registers except for allowed ones.
> >     . TSC: add a check if tsc is immutable and return an error.  Because the KVM
> >       implementation updates the internal tsc state and it's difficult to back
> >       out those changes.  Instead, skip the logic.
> >     . dirty logging: add check if dirty logging is supported.
> >     . exceptions/SMI/MCE/SIPI/INIT: silently ignore
> > 
> >     Note: virtual external interrupt and NMI can be injected into TDX guests.
> > 
> > - KVM MMU integration
> > One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
> > the guest physical address is private (the bit is cleared) or shared (the bit is
> > set).  The bits are called stolen bits.
> > 
> >   - Stolen bits framework
> >     systematically tracks which guest physical address, shared or private, is
> >     used.
> > 
> >   - Shared EPT and secure EPT
> >     There are two EPTs. Shared EPT (the conventional one) and Secure
> >     EPT(the new one). Shared EPT is handled the same for the stolen
> >     bit set.  Secure EPT points to private guest pages.  To resolve
> >     EPT violation, KVM walks one of two EPTs based on faulted GPA.
> >     Because it's costly to access secure EPT during walking EPTs with
> >     SEAMCALLs for the private guest physical address, another private
> >     EPT is used as a shadow of Secure-EPT with the existing logic at
> >     the cost of extra memory.
> > 
> > The following depicts the relationship.
> > 
> >                     KVM                             |       TDX module
> >                      |                              |           |
> >         -------------+----------                    |           |
> >         |                      |                    |           |
> >         V                      V                    |           |
> >      shared GPA           private GPA               |           |
> >   CPU shared EPT pointer  KVM private EPT pointer   |  CPU secure EPT pointer
> >         |                      |                    |           |
> >         |                      |                    |           |
> >         V                      V                    |           V
> >   shared EPT                private EPT--------mirror----->Secure EPT
> >         |                      |                    |           |
> >         |                      \--------------------+------\    |
> >         |                                           |      |    |
> >         V                                           |      V    V
> >   shared guest page                                 |    private guest page
> >                                                     |
> >                                                     |
> >                               non-encrypted memory  |    encrypted memory
> >                                                     |
> > 
> >   - Operating on Secure EPT
> >     Use the TDX module APIs to operate on Secure EPT.  To call the TDX API
> >     during resolving EPT violation, add hooks to additional operation and wiring
> >     it to TDX backend.
> > 
> > * References
> > 
> > [1] TDX specification
> >    https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> > [2] Intel Trust Domain Extensions (Intel TDX)
> >    https://cdrdv2.intel.com/v1/dl/getContent/726790
> > [3] Intel CPU Architectural Extensions Specification
> >    https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> > [4] Intel TDX Module 1.0 Specification
> >    https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
> > [5] Intel TDX Loader Interface Specification
> >   https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> > [6] Intel TDX Guest-Hypervisor Communication Interface
> >    https://cdrdv2.intel.com/v1/dl/getContent/726790
> > [7] Intel TDX Virtual Firmware Design Guide
> >    https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
> > [8] intel public github
> >    kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> >    TDX guest branch: https://github.com/intel/tdx/tree/guest
> >    qemu TDX https://github.com/intel/qemu-tdx
> > [9] TDVF
> >     https://github.com/tianocore/edk2-staging/tree/TDVF
> >     This was merged into EDK2 main branch. https://github.com/tianocore/edk2
> > 
> > Chao Gao (3):
> >   KVM: x86: Move check_processor_compatibility from init ops to runtime
> >     ops
> >   Partially revert "KVM: Pass kvm_init()'s opaque param to additional
> >     arch funcs"
> >   KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
> >     wrmsr
> > 
> > Isaku Yamahata (72):
> >   KVM: Refactor CPU compatibility check on module initialiization
> >   x86/virt/vmx/tdx: export platform_tdx_enabled()
> >   KVM: TDX: Detect CPU feature on kernel module initialization
> >   KVM: x86: Refactor KVM VMX module init/exit functions
> >   KVM: TDX: Add placeholders for TDX VM/vcpu structure
> >   x86/virt/tdx: Add a helper function to return system wide info about
> >     TDX module
> >   KVM: TDX: Initialize TDX module when loading kvm_intel.ko
> >   KVM: TDX: Make TDX VM type supported
> >   [MARKER] The start of TDX KVM patch series: TDX architectural
> >     definitions
> >   KVM: TDX: Define TDX architectural definitions
> >   KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
> >   KVM: TDX: Add helper functions to print TDX SEAMCALL error
> >   [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
> >   x86/cpu: Add helper functions to allocate/free TDX private host key id
> >   KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
> >   KVM: TDX: Make pmu_intel.c ignore guest TD case
> >   [MARKER] The start of TDX KVM patch series: TD vcpu
> >     creation/destruction
> >   KVM: TDX: allocate/free TDX vcpu structure
> >   KVM: TDX: allocate/free TDX vcpu structure
> >   [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
> >   KVM: x86/mmu: introduce config for PRIVATE KVM MMU
> >   [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
> >     TDX
> >   KVM: x86/mmu: Disallow fast page fault on private GPA
> >   KVM: VMX: Introduce test mode related to EPT violation VE
> >   [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
> >   KVM: x86/mmu: Focibly use TDP MMU for TDX
> >   KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
> >   KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
> >   KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
> >   [MARKER] The start of TDX KVM patch series: TDX EPT violation
> >   KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
> >   KVM: TDX: don't request KVM_REQ_APIC_PAGE_RELOAD
> >   KVM: TDX: TDP MMU TDX support
> >   [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
> >   KVM: x86/mmu: steal software usable git to record if GFN is for shared
> >     or not
> >   KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
> >   [MARKER] The start of TDX KVM patch series: TD finalization
> >   KVM: TDX: Create initial guest memory
> >   KVM: TDX: Finalize VM initialization
> >   [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
> >   KVM: TDX: Add helper assembly function to TDX vcpu
> >   KVM: TDX: Implement TDX vcpu enter/exit path
> >   KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
> >   KVM: TDX: restore host xsave state when exit from the guest TD
> >   KVM: TDX: restore user ret MSRs
> >   [MARKER] The start of TDX KVM patch series: TD vcpu
> >     exits/interrupts/hypercalls
> >   KVM: TDX: complete interrupts after tdexit
> >   KVM: TDX: restore debug store when TD exit
> >   KVM: TDX: handle vcpu migration over logical processor
> >   KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
> >     behavior
> >   KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
> >   KVM: TDX: Implement interrupt injection
> >   KVM: TDX: Implements vcpu request_immediate_exit
> >   KVM: TDX: Implement methods to inject NMI
> >   KVM: TDX: Add a place holder to handle TDX VM exit
> >   KVM: TDX: handle EXIT_REASON_OTHER_SMI
> >   KVM: TDX: handle ept violation/misconfig exit
> >   KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
> >   KVM: TDX: Add a place holder for handler of TDX hypercalls
> >     (TDG.VP.VMCALL)
> >   KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
> >   KVM: TDX: Handle TDX PV CPUID hypercall
> >   KVM: TDX: Handle TDX PV HLT hypercall
> >   KVM: TDX: Handle TDX PV port io hypercall
> >   KVM: TDX: Implement callbacks for MSR operations for TDX
> >   KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
> >   KVM: TDX: Handle TDX PV report fatal error hypercall
> >   KVM: TDX: Handle TDX PV map_gpa hypercall
> >   KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
> >   KVM: TDX: Silently discard SMI request
> >   KVM: TDX: Silently ignore INIT/SIPI
> >   Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
> >   KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
> > 
> > Rick Edgecombe (1):
> >   KVM: x86/mmu: Add address conversion functions for TDX shared bits
> > 
> > Sean Christopherson (25):
> >   KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> >   KVM: Enable hardware before doing arch VM initialization
> >   KVM: x86: Introduce vm_type to differentiate default VMs from
> >     confidential VMs
> >   KVM: TDX: Add TDX "architectural" error codes
> >   KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
> >   KVM: TDX: create/destroy VM structure
> >   KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
> >   KVM: TDX: Do TDX specific vcpu initialization
> >   KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
> >   KVM: x86/mmu: Allow non-zero value for non-present SPTE
> >   KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
> >   KVM: x86/mmu: Allow per-VM override of the TDP max page level
> >   KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for
> >     private mmu
> >   KVM: x86/mmu: Disallow dirty logging for x86 TDX
> >   KVM: VMX: Split out guts of EPT violation to common/exposed function
> >   KVM: VMX: Move setting of EPT MMU masks to common VT-x code
> >   KVM: TDX: Add load_mmu_pgd method for TDX
> >   KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
> >   KVM: TDX: Add support for find pending IRQ in a protected local APIC
> >   KVM: x86: Assume timer IRQ was injected if APIC state is proteced
> >   KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
> >     argument
> >   KVM: VMX: Move NMI/exception handler to common helper
> >   KVM: x86: Split core of hypercall emulation to helper function
> >   KVM: TDX: Handle TDX PV MMIO hypercall
> >   KVM: TDX: Add methods to ignore accesses to CPU state
> > 
> > Xiaoyao Li (1):
> >   KVM: TDX: initialize VM with TDX specific parameters
> > 
> >  Documentation/virt/kvm/api.rst                |   30 +-
> >  .../virt/kvm/intel-tdx-layer-status.rst       |   33 +
> >  Documentation/virt/kvm/intel-tdx.rst          |  381 +++
> >  Documentation/virt/kvm/tdx-tdp-mmu.rst        |  466 ++++
> >  arch/arm64/kvm/arm.c                          |    2 +-
> >  arch/mips/kvm/mips.c                          |   14 +-
> >  arch/powerpc/kvm/powerpc.c                    |    2 +-
> >  arch/riscv/kvm/main.c                         |    2 +-
> >  arch/s390/kvm/kvm-s390.c                      |    2 +-
> >  arch/x86/events/intel/ds.c                    |    1 +
> >  arch/x86/include/asm/kvm-x86-ops.h            |   10 +
> >  arch/x86/include/asm/kvm_host.h               |   56 +-
> >  arch/x86/include/asm/tdx.h                    |   67 +
> >  arch/x86/include/asm/vmx.h                    |   14 +
> >  arch/x86/include/uapi/asm/kvm.h               |   95 +
> >  arch/x86/include/uapi/asm/vmx.h               |    5 +-
> >  arch/x86/kvm/Kconfig                          |    4 +
> >  arch/x86/kvm/Makefile                         |    3 +-
> >  arch/x86/kvm/irq.c                            |    3 +
> >  arch/x86/kvm/lapic.c                          |   37 +-
> >  arch/x86/kvm/lapic.h                          |    2 +
> >  arch/x86/kvm/mmu.h                            |   42 +-
> >  arch/x86/kvm/mmu/mmu.c                        |  360 ++-
> >  arch/x86/kvm/mmu/mmu_internal.h               |  123 +-
> >  arch/x86/kvm/mmu/paging_tmpl.h                |    5 +-
> >  arch/x86/kvm/mmu/spte.c                       |   46 +-
> >  arch/x86/kvm/mmu/spte.h                       |   65 +-
> >  arch/x86/kvm/mmu/tdp_iter.c                   |    1 +
> >  arch/x86/kvm/mmu/tdp_iter.h                   |    5 +-
> >  arch/x86/kvm/mmu/tdp_mmu.c                    |  690 ++++-
> >  arch/x86/kvm/mmu/tdp_mmu.h                    |   12 +-
> >  arch/x86/kvm/svm/svm.c                        |   13 +-
> >  arch/x86/kvm/vmx/common.h                     |  174 ++
> >  arch/x86/kvm/vmx/evmcs.c                      |    2 +-
> >  arch/x86/kvm/vmx/evmcs.h                      |    2 +-
> >  arch/x86/kvm/vmx/main.c                       | 1071 +++++++
> >  arch/x86/kvm/vmx/pmu_intel.c                  |   39 +-
> >  arch/x86/kvm/vmx/pmu_intel.h                  |   28 +
> >  arch/x86/kvm/vmx/posted_intr.c                |   43 +-
> >  arch/x86/kvm/vmx/posted_intr.h                |   13 +
> >  arch/x86/kvm/vmx/tdx.c                        | 2465 +++++++++++++++++
> >  arch/x86/kvm/vmx/tdx.h                        |  275 ++
> >  arch/x86/kvm/vmx/tdx_arch.h                   |  157 ++
> >  arch/x86/kvm/vmx/tdx_errno.h                  |   29 +
> >  arch/x86/kvm/vmx/tdx_error.c                  |   22 +
> >  arch/x86/kvm/vmx/tdx_ops.h                    |  188 ++
> >  arch/x86/kvm/vmx/vmenter.S                    |  146 +
> >  arch/x86/kvm/vmx/vmx.c                        |  737 ++---
> >  arch/x86/kvm/vmx/vmx.h                        |   39 +-
> >  arch/x86/kvm/vmx/x86_ops.h                    |  235 ++
> >  arch/x86/kvm/x86.c                            |  148 +-
> >  arch/x86/virt/vmx/tdx/seamcall.S              |    2 +
> >  arch/x86/virt/vmx/tdx/tdx.c                   |   54 +-
> >  arch/x86/virt/vmx/tdx/tdx.h                   |   52 -
> >  include/linux/kvm_host.h                      |    4 +-
> >  include/uapi/linux/kvm.h                      |    2 +
> >  tools/arch/x86/include/uapi/asm/kvm.h         |   95 +
> >  tools/include/uapi/linux/kvm.h                |    1 +
> >  virt/kvm/kvm_main.c                           |   67 +-
> >  59 files changed, 7877 insertions(+), 804 deletions(-)
> >  create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
> >  create mode 100644 Documentation/virt/kvm/intel-tdx.rst
> >  create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
> >  create mode 100644 arch/x86/kvm/vmx/common.h
> >  create mode 100644 arch/x86/kvm/vmx/main.c
> >  create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
> >  create mode 100644 arch/x86/kvm/vmx/tdx.c
> >  create mode 100644 arch/x86/kvm/vmx/tdx.h
> >  create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> >  create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> >  create mode 100644 arch/x86/kvm/vmx/tdx_error.c
> >  create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
> >  create mode 100644 arch/x86/kvm/vmx/x86_ops.h
> > 
> > -- 
> > 2.25.1
> > 
> 
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>
Chao Peng July 12, 2022, 10:54 a.m. UTC | #4
On Tue, Jul 12, 2022 at 01:07:20PM +0800, Chao Gao wrote:
> On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
> >Hi. Because my description on large page support was terse, I wrote up more
> >detailed one.  Any feedback/thoughts on large page support?
> >
> >TDP MMU large page support design
> >
> >Two main discussion points
> >* how to track page status. private vs shared, no-largepage vs can-be-largepage
> 
> ...
> 
> >
> >Tracking private/shared and large page mappable
> >-----------------------------------------------
> >VMM needs to track that page is mapped as private or shared at 4KB granularity.
> >For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
> >track the page can be mapped as a large page (regarding private/shared).  VMM
> >updates it on MapGPA and references it on the EPT violation path. (****)
> 
> Isaku,
> 
> + Peng Chao
> 
> Doesn't UPM guarantee that 2MB/1GB large page in CR3 should be either all
> private or all shared?
> 
> KVM always retrieves the mapping level in CR3 and enforces that EPT's
> page level is not greater than that in CR3. My point is if UPM already enforces
> no mixed pages in a large page, then KVM needn't do that again (UPM can
> be trusted).

The backing store in the UMP can tell KVM which page level it can
support for a given private gpa, similar to host_pfn_mapping_level() for
shared address.

However, this solely represents the backing store's capability, KVM
still needs additional info to decide whether that can be safely mapped
as 2M/1G, e.g. all the following pages in the 2M/1G range should be all
private, currently this is not something backing store can tell.

Actually, in UPM v7 we let KVM record this info so one possible solution
is making use of it.

  https://lkml.org/lkml/2022/7/6/259

Then to map a page as 2M, KVM needs to check:
  - Memory backing store support that level
  - All pages in 2M range are private as we recorded through
    KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
  - No existing partial 4K map(s) in 2M range

Chao
> 
> Maybe I am misunderstanding something?
Isaku Yamahata July 12, 2022, 5:22 p.m. UTC | #5
On Tue, Jul 12, 2022 at 06:54:19PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> On Tue, Jul 12, 2022 at 01:07:20PM +0800, Chao Gao wrote:
> > On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
> > >Hi. Because my description on large page support was terse, I wrote up more
> > >detailed one.  Any feedback/thoughts on large page support?
> > >
> > >TDP MMU large page support design
> > >
> > >Two main discussion points
> > >* how to track page status. private vs shared, no-largepage vs can-be-largepage
> > 
> > ...
> > 
> > >
> > >Tracking private/shared and large page mappable
> > >-----------------------------------------------
> > >VMM needs to track that page is mapped as private or shared at 4KB granularity.
> > >For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
> > >track the page can be mapped as a large page (regarding private/shared).  VMM
> > >updates it on MapGPA and references it on the EPT violation path. (****)
> > 
> > Isaku,
> > 
> > + Peng Chao
> > 
> > Doesn't UPM guarantee that 2MB/1GB large page in CR3 should be either all
> > private or all shared?
> > 
> > KVM always retrieves the mapping level in CR3 and enforces that EPT's
> > page level is not greater than that in CR3. My point is if UPM already enforces
> > no mixed pages in a large page, then KVM needn't do that again (UPM can
> > be trusted).
> 
> The backing store in the UMP can tell KVM which page level it can
> support for a given private gpa, similar to host_pfn_mapping_level() for
> shared address.
>
> However, this solely represents the backing store's capability, KVM
> still needs additional info to decide whether that can be safely mapped
> as 2M/1G, e.g. all the following pages in the 2M/1G range should be all
> private, currently this is not something backing store can tell.

This argument applies to shared GPA.  The shared pages is backed by normal file
mapping with UPM.  When KVM is mapping shared GPA, the same check is needed.  So
I think KVM has to track all private or all shared or no-largepage at 2MB/1GB
level.  If UPM tracks shared-or-private at 4KB level, probably KVM may not need to
track it at 4KB level.


> Actually, in UPM v7 we let KVM record this info so one possible solution
> is making use of it.
> 
>   https://lkml.org/lkml/2022/7/6/259
> 
> Then to map a page as 2M, KVM needs to check:
>   - Memory backing store support that level
>   - All pages in 2M range are private as we recorded through
>     KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
>   - No existing partial 4K map(s) in 2M range
Isaku Yamahata July 12, 2022, 5:35 p.m. UTC | #6
On Tue, Jul 12, 2022 at 06:49:25PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
> > Hi. Because my description on large page support was terse, I wrote up more
> > detailed one.  Any feedback/thoughts on large page support?
> > 
> > TDP MMU large page support design
> > 
> > Two main discussion points
> > * how to track page status. private vs shared, no-largepage vs can-be-largepage
> > * how to trigger merging mapping from 4KB/2MB to 2MB/1GB
> > 
> > Expected private-vs-shared page usage
> > -------------------------------------
> > On TD boot all pages are private and TD converts pages into shared if necessary.
> > * Most of the guest pages remain private.
> > * Only limited pages are converted at kernel boot
> >   ** bounce buffer for IO (virt-io).  It's allocated as swiotlb.  Its size is
> >      64MB or 6% of total guest memory.
> >   ** KVM PV shared page. (the current guest TD doesn't use KVM PV shared page.)
> > * Only a small number of pages are dynamically converted from private to shared
> >   and vice versa.  This usage is very limited. e.g. GetQuote, the lack of
> >   swiotlb buffer
> > 
> > 
> > Theory of Secure-EPT operations related to large page
> > -----------------------------------------------------
> > TDX Secure-EPT has differences from VMX EPT.
> > To add a page to Secure-EPT
> > 
> > * Here is the operation to resolve the EPT violation.
> > 1. TD: Accepts GPA.  TD needs to accept GPA before accessing GPA because TD
> >    needs to detect that VMM unmaps GPA and maps GPA again.
> > 2. EPT violation is triggered.  TD exit to VMM.
> > 3. VMM: allocate a page for GPA and TDH.MEM.PAGE.AUG it to GPA.  Resume TD vcpu.
> >    (3a. TD: #VE<EPT violation> is injected.  #VE handler accepts the page)
> > 4. TD: resume #VE and continue TD vcpu execution
> > 
> > TD may choose step 1. In that case, After step 3. #VE is injected into TD and,
> > TD #VE handler needs to accept the page.
> > 
> > When adding a page to Secure-EPT again, the page contexts are cleared and the
> > page is encrypted.  If a page is disassociated from Secure-EPT and added again,
> > the page content is lost.
> > 
> > * TDG.VP.VMCALL<MapGPA> hypercall
> > The page associated with GPA can be private or shared.  TD converts the GPA by
> > TDG.VP.VMCALL<MapGPA> hypercall from private to shared or vice versa.  VMM
> > tracks whether the given GPA is private or shared.
> > 
> > * mapping merge(promote)/split(demote)
> > The page can be mapped as large page (2MB or 1GB) in addition to 4KB.  The
> > mapping can be merged(4KB/2MB -> 2MB/1GB) or split(2MB/1GB -> 4KB/2MB) by TDX
> > SEAMCALL TDH.MEM.PAGE.PROMOTE and TDH.MEM.PAGE.DEMOTE.
> > The merge of mapping requires all the pages needs to be mapped, unlike VMX EPT
> > because of encryption.  This implies the current KVM implementation doesn't work
> > for TDX when merging mapping as follows
> > 
> > - EPT violation and host page is 2MB mappable.
> >   some of the 4KB pages of the given 2MB page are already mapped, some not.
> >   i.e. 2MB EPT -> 4KB EPT -> 4K pages
> > - KVM page fault handler zap 2MB EPT entry and populate 2MB EPT entry
> >   zap: 2MB EPT: non present
> >   populate 2MB: -> 2MB page
> > 
> > If VMM zaps 2MB Secure-EPT entry, the page contents will be lost for TDX.
> > Mapping merge requires all pages are already mapped.
> > 
> > Instead, the following steps are needed.
> > - EPT violation and host page is 2MB mappable.
> >   some of the 4KB pages of the given 2MB page are already mapped.  Some not.
> >   i.e. 2MB EPT -> 4KB EPT -> 4K pages
> > - VMM checks all 4KB GPAs are private. If not, it can't be mapped as a large page.
> >   (****)
> > - VMM checks all 4KB GPAs are already mapped.  If not, give up mapping merge.
> >   (or map missing 4KB pages.)
> > - mapping merge by TDH.MEM.PAGE.PROMOTE
> > 
> > The mapping split for TDX Secure-EPT works similarly to the VMX EPT case.
> > 
> > 
> > EPT violation and MapGPA
> > ------------------------
> > - EPT violation is a fast path
> > - MapGPA is not a fast path.
> > => Keep the EPT violation path optimized and complicates the MapGPA path.  For
> > (****) check, we don't want to scan the 4KB mapping on EPT violation.  Instead,
> > the MapGPA path scans it and records the result as the page can be mapped as 2MB
> > due to private/shared.
> 
> This sounds reasonable, Instead of tracking that in MapGPA,  maybe
> KVM_MEMORY_ENCRYPT_{UN,}REG_REGION introduced in UPM v7 is a better
> place to put the scan code in.
> 
>   https://lkml.org/lkml/2022/7/6/259
> 
> Both the MapGPA (explicit conversion) and the EPT violation (implicit
> conversion) can cause invocation to these two ioctls and need update to
> this info.
> 
> > 
> > 
> > Tracking private/shared and large page mappable
> > -----------------------------------------------
> > VMM needs to track that page is mapped as private or shared at 4KB granularity.
> > For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
> > track the page can be mapped as a large page (regarding private/shared).  VMM
> > updates it on MapGPA and references it on the EPT violation path. (****)
> > 
> > For 4KB pages, 1 bit is needed. private or shared.  Let's call it shared-mask bit.
> > For 2MB/1GB pages, 2 bit is needed. large page mappable or not. private or
> > shared if mappable.  Let's call it no-largepage bit.
> 
> I'm just thinking maybe we don't need introduce new bits, instead we
> reuse lpage_info where we already use it to track whether a page can be
> mapped at specified page level in kvm_mmu_max_mapping_level(). Then in
> the above two ioctls we do a scan for each level and update lpage_info.
> For example, we should disallow_lpage if private/shared pages are mixed
> in that page level.
> 
> It's however a bit tricky to manage lpage_info.disallow_lpage in these
> two ioctls with current code. We can't simply do disallow_lpage++ and
> disallow_lpage--. One possible solution can treat disallow_lpage as a
> mask instead of a count. Then we define bits like below for use:
>   - USER_GFN_UNALIGNED set when memslot user_address/private_offset/gfn
>     is not aligned on the page level
>   - PAGE_TRACKING set during page tracking
>   - PRIVITE_SHARED_MIXED set when private/shared pages are mixed
> 
> In page fault handler the page can be mapped at that level only when all
> bits are zero and in above two ioctls we just switch on/off bit
> PRIVITE_SHARED_MIXED.

So steal 1 or 2 bits from kvm_lpage_info.disallow_lpage instead of adding one more
array in struct kvm_arch_memory_slot.  Nice idea.  Let's call it option A.1).
We increment/decrement disallow_lpage with option A.). With option A.1), it
automatically handled.

pros:
+SPTE_SHARED_MASK is not needed
cons:
-one more look-up on EPT violation


> Currently UMP don't have this code yet, but can be added if feasible.

Anyway let me integrate UPM v7.

Thanks,


> Chao
> > 
> > Option A.)
> >   Allocate array for pages in struct kvm_arch_memory_slot on TD creation.
> >   struct kvm_arch_memory_slot {
> >     +struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
> >   }
> > 
> >   pros:
> >   +straight forward implementation
> >   +SPTE_SHARED_MASK is not needed
> >   cons:
> >   -memory overhead is high
> >   -not optimized for expected usage
> >   -one more look-up on EPT violation
> > 
> > Option B.) Steal two software usable bits from SPTE and record them in SPTE.
> >            SPTE_SHARED_MASK, SPTE_NOLARGE_PAGE_MASK
> >   pros:
> >   +optimized for EPT violation
> >   cons:
> >   -2bits used in SPTE entry
> >   -complicates the MapGPA path.
> > 
> > Option C.) Steal one software usable bit from SPTE and record it in SPTE.
> >            SPTE_SHARED_MASK
> >            For 2MB/1GB, allocate bitmap in kvm_mmu_page.
> >            struct kvm_mmu_page {
> >              bitmap nolarge
> >            }
> >   pros:
> >   +optimized for EPT violation
> >   cons:
> >   -complicates the MapGPA path.
> >   -information is scattered in SPTE and struct kvm_mmu_page
> > 
> > 
> > How to update those bits
> > ------------------------
> > - MapGPA
> >   - at 4KB level, set or clear shared-mask bit.
> >   - Scan 512 4KB bit, at 2MB level
> >     - set or clear shared-mask bit, clear no-largepage bit or
> >     - clear shared-mask bit, set no-largepage bit
> >     - increment/decrement lpageinfo to prevent/allow large page
> >   - similar for 1GB level
> >   Note: This logic might a bit tricky.
> > 
> > - EPT violation
> >   - If 2MB large page is allowed, check if no-largepage bit
> >     - If no-largepage bit is set, => go down to 4KB page
> >     - If no-largepage bit is cleared => try to map 2MB page
> >       - If 4KB level is not mapped, map 2MB page
> >       - If some 4KB level is already mapped, go down to 4KB.
> >         Don't try to merge mapping. Or it's possible to try to merge mapping.
> >   Note: 512 4KB entry scanning is not done at EPT violation because it's fast
> >         path.
> > 
> > 
> > Map merging
> > -----------
> > Map merging is necessary for TD migration. (Map split is the easy part.)  The
> > current KVM implementation zaps the range (mmu notification or lpage recovery
> > worker) and expects large page mapping on the next EPT violation.
> > 
> > Option A.) Keep the code similar to map merging logic.
> > Zap 2MB EPT entry in some sense and trigger map merging logic on the next EPT
> > violation.  To keep encrypted page contents, zapped EPT entries needs to keep
> > the page.  Steal one more bits from SPTE. SPTE_PRIVATE_BLOCKED_MASK.
> > It means that the page is zapped from SPTE. but it still alive and references
> > page.
> > 
> > Option B.) In the callback, directly merge mapping somehow.  In this case, mmu
> > notifier usage doesn't make sense.
> > 
> > NOTE:
> > - Implement map merging in MapGPA. This doesn't work for dirty page logging.
> > - We can utilize kvm_nx_lpage_recovery_worker
> > - We can utilize THP. Probably doesn't work well for fd-based private memory.
> > 
> > Thanks,
> > Isaku Yamayhata
> > 
> > On Mon, Jun 27, 2022 at 02:52:52PM -0700,
> > isaku.yamahata@intel.com wrote:
> > 
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > KVM TDX basic feature support
> > > 
> > > Hello.  This is v7 the patch series vof KVM TDX support.
> > > This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
> > > The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
> > > How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM
> > > 
> > > Major changes from v6:
> > > - rebased to v5.19 base
> > > 
> > > TODO:
> > > - integrate fd-based guest memory. As the discussion is still on-going, I
> > >   intentionally dropped fd-based guest memory support yet.  The integration can
> > >   be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
> > > - 2M large page support. It's work-in-progress.
> > > For large page support, there are several design choices. Here is the design options.
> > > Any thoughts/feedback?
> > > 
> > > KVM MMU Large page support for TDX
> > > 
> > > * What needs to be done
> > > - Track private or shared of each page size (4KB, 2MB, 1GB) based on
> > >   TDG.VP.VMCALL<MapGPA>.  For large pages(2MB, 1GB), it can be mixed (some
> > >   lower-size pages are private and some shared.)  In this case, the page can't
> > >   be large.
> > > - if necessary, split large page on TDG.VP.VMCALL<MapGPA>
> > >   (split on dirty page tracking is future work)
> > > - resolving KVM page fault
> > >   When resolving a private page and the page is large in the host, GPA can be
> > >   resolved as a large page in Secure-EPT.  Even if the page is large on the host
> > >   side, sometimes a 4KB page can be resolved because it's up to guest TD to
> > >   accept at 4KB, 2MB, or 1GB.
> > > - collapsing pages into a large page.
> > >   At this point, it's okay to not implement this.  When dirty page tracking is
> > >   supported, this needs to be supported.
> > >   - On MapGPA, the page can be collapsed into a large page
> > >   - handle zapping SPTE and try to collapse the pages on the next KVM page fault
> > >     Unlike the EPT case, some trick is needed.
> > > - For performance, optimize KVM page fault path at the cost of complicating
> > >   MapGPA path.
> > > 
> > > * options to track private or shared
> > > At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
> > > 1GB case). For 4KB each page, 1 bit per page is needed. private or shared.  For
> > > large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
> > > mixed).  When resolving KVM page fault, we don't want to check the lower-size
> > > pages to check if the given GPA can be a large for performance.  On MapGPA check
> > > it instead.
> > > 
> > > Option A). enhance kvm_arch_memory_slot
> > >   enum kvm_page_type {
> > >        KVM_PAGE_TYPE_INVALID,
> > >        KVM_PAGE_TYPE_SHARED,
> > >        KVM_PAGE_TYPE_PRIVATE,
> > >        KVM_PAGE_TYPE_MIXED,
> > >   };
> > > 
> > >   struct kvm_page_attr {
> > >        enum kvm_page_type type;
> > >   };
> > > 
> > >  struct kvm_arch_memory_slot {
> > >  +      struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
> > > 
> > > Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
> > > If !SPTE_MIXED_MASK, it can be large page.
> > > 
> > > Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
> > > kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.
> > > 
> > > 
> > > * comparison
> > > A).
> > > + straightforward to implement
> > > + SPTE_SHARED_MASK isn't needed
> > > - memory overhead compared to B). or C).
> > > - more memory reference on KVM page fault
> > > 
> > > B).
> > > + simpler than C) (complex than A)?)
> > > + efficient on KVM page fault. (only SPTE reference)
> > > + low memory overhead
> > > - Waste precious SPTE bits.
> > > 
> > > C).
> > > + efficient on KVM page fault. (only SPTE reference)
> > > + low memory overhead
> > > - complicates MapGPA
> > > - scattered data structure
> > > 
> > > Thanks,
> > > Isaku Yamahata
> > > 
> > > Changes from v6:
> > > - rebased to v5.19
> > > 
> > > Changes from v5:
> > > - export __seamcall and use it
> > > - move mutex lock from callee function of smp_call_on_cpu to the caller.
> > > - rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
> > > - updated comment
> > > - drop the use of tdh_mng_key.reclaimid(): as the function is for backward
> > >   compatibility to only return success
> > > - struct kvm_tdx_cmd: metadata => flags, added __u64 error.
> > > - make this ioctl systemwide ioctl
> > > - ABI change to struct kvm_init_vm
> > > - guest_tsc_khz: use kvm->arch.default_tsc_khz
> > > - rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
> > > - drop exporting kvm_set_tsc_khz().
> > > - fix kvm_tdp_page_fault() for mtrr emulation
> > > - rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
> > > - drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
> > >   keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
> > > - update commit message
> > > - rename shadow_init_value => shadow_nonprsent_value
> > > - added ept_violation_ve_test mode
> > > - shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
> > > - legacy MMU case
> > >   => - mmu_topup_shadow_page_cache(), kvm_mmu_create()
> > >      - FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> > > - #VE warning:
> > > - rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
> > > - merge into Like we discussed, this patch should be merged with patch
> > >   "KVM: x86/mmu: Allow non-zero init value for shadow PTE".
> > > - fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
> > > - introduce kvm_gfn_for_root(kvm, root, gfn)
> > > - add only_shared argument to kvm_tdp_mmu_handle_gfn()
> > > - use kvm_arch_dirty_log_supported()
> > > - rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
> > > - rename: is_private_prohibit_spte() => spte_shared_mask()
> > > - fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
> > > - dropped this patch as the change was merged into kvm/queue
> > > - update vt_apicv_post_state_restore()
> > > - use is_64_bit_hypercall()
> > > - comment: expand MSMI -> Machine Check System Management Interrupt
> > > - fixed TDX_SEPT_PFERR
> > > - tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
> > > - rename tdmvcall_exit_readon() => tdvmcall_leaf()
> > > - remove optional zero check of argument.
> > > - do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
> > >    in kvm_vcpu_ioctl_smi and __apic_accept_irq.
> > > - WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
> > > - introduce vcpu_deliver_init to x86_ops
> > > - sprinkeled KVM_BUG_ON()
> > > 
> > > Changes from v4:
> > > - rebased to TDX host kernel patch series.
> > > - include all the patches to make this patch series working.
> > > - add [MARKER] patches to mark the patch layer clear.
> > > 
> > > ---
> > > * What's TDX?
> > > TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
> > > Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
> > > Domain (TD) for confidential computing.
> > > 
> > > A TD runs in a CPU mode that is designed to protect the confidentiality of its
> > > memory contents and its CPU state from any other software, including the hosting
> > > Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
> > > 
> > > We have more detailed explanations below (***).
> > > We have the high-level design of TDX KVM below (****).
> > > 
> > > In this patch series, we use "TD" or "guest TD" to differentiate it from the
> > > current "VM" (Virtual Machine), which is supported by KVM today.
> > > 
> > > 
> > > * The organization of this patch series
> > > This patch series is on top of the patches series "TDX host kernel support":
> > > https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
> > > 
> > > this patch series is available at
> > > https://github.com/intel/tdx/releases/tag/kvm-upstream
> > > The corresponding patches to qemu are available at
> > > https://github.com/intel/qemu-tdx/commits/tdx-upstream
> > > 
> > > The relations of the layers are depicted as follows.
> > > The arrows below show the order of patch reviews we would like to have.
> > > 
> > > The below layers are chosen so that the device model, for example, qemu can
> > > exercise each layering step by step.  Check if TDX is supported, create TD VM,
> > > create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
> > > vcpu exits/hypercalls/interrupts to run TD fully.
> > > 
> > >   TDX vcpu
> > >   interrupt/exits/hypercall<------------\
> > >         ^                               |
> > >         |                               |
> > >   TD finalization                       |
> > >         ^                               |
> > >         |                               |
> > >   TDX EPT violation<------------\       |
> > >         ^                       |       |
> > >         |                       |       |
> > >   TD vcpu enter/exit            |       |
> > >         ^                       |       |
> > >         |                       |       |
> > >   TD vcpu creation/destruction  |       \-------KVM TDP MMU MapGPA
> > >         ^                       |                       ^
> > >         |                       |                       |
> > >   TD VM creation/destruction    \---------------KVM TDP MMU hooks
> > >         ^                                               ^
> > >         |                                               |
> > >   TDX architectural definitions                 KVM TDP refactoring for TDX
> > >         ^                                               ^
> > >         |                                               |
> > >    TDX, VMX    <--------TDX host kernel         KVM MMU GPA stolen bits
> > >    coexistence          support
> > > 
> > > 
> > > The followings are explanations of each layer.  Each layer has a dummy commit
> > > that starts with [MARKER] in subject.  It is intended to help to identify where
> > > each layer starts.
> > > 
> > > TDX host kernel support:
> > >         https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
> > >         The guts of system-wide initialization of TDX module.  There is an
> > >         independent patch series for host x86.  TDX KVM patches call functions
> > >         this patch series provides to initialize the TDX module.
> > > 
> > > TDX, VMX coexistence:
> > >         Infrastructure to allow TDX to coexist with VMX and trigger the
> > >         initialization of the TDX module.
> > >         This layer starts with
> > >         "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
> > > TDX architectural definitions:
> > >         Add TDX architectural definitions and helper functions
> > >         This layer starts with
> > >         "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
> > > TD VM creation/destruction:
> > >         Guest TD creation/destroy allocation and releasing of TDX specific vm
> > >         and vcpu structure.  Create an initial guest memory image with TDX
> > >         measurement.
> > >         This layer starts with
> > >         "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
> > > TD vcpu creation/destruction:
> > >         guest TD creation/destroy Allocation and releasing of TDX specific vm
> > >         and vcpu structure.  Create an initial guest memory image with TDX
> > >         measurement.
> > >         This layer starts with
> > >         "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
> > > TDX EPT violation:
> > >         Create an initial guest memory image with TDX measurement.  Handle
> > >         secure EPT violations to populate guest pages with TDX SEAMCALLs.
> > >         This layer starts with
> > >         "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
> > > TD vcpu enter/exit:
> > >         Allow TDX vcpu to enter into TD and exit from TD.  Save CPU state before
> > >         entering into TD.  Restore CPU state after exiting from TD.
> > >         This layer starts with
> > >         "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
> > > TD vcpu interrupts/exit/hypercall:
> > >         Handle various exits/hypercalls and allow interrupts to be injected so
> > >         that TD vcpu can continue running.
> > >         This layer starts with
> > >         "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
> > > 
> > > KVM MMU GPA shared bit:
> > >         Introduce framework to handle shared bit repurposed bit of GPA TDX
> > >         repurposed a bit of GPA to indicate shared or private. If it's shared,
> > >         it's the same as the conventional VMX EPT case.  VMM can access shared
> > >         guest pages.  If it's private, it's handled by Secure-EPT and the guest
> > >         page is encrypted.
> > >         This layer starts with
> > >         "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
> > > KVM TDP refactoring for TDX:
> > >         TDX Secure EPT requires different constants. e.g. initial value EPT
> > >         entry value etc. Various refactoring for those differences.
> > >         This layer starts with
> > >         "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
> > > KVM TDP MMU hooks:
> > >         Introduce framework to TDP MMU to add hooks in addition to direct EPT
> > >         access TDX added Secure EPT which is an enhancement to VMX EPT.  Unlike
> > >         conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
> > >         use TDX SEAMCALLs to operate on Secure EPT.
> > >         This layer starts with
> > >         "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
> > > KVM TDP MMU MapGPA:
> > >         Introduce framework to handle switching guest pages from private/shared
> > >         to shared/private.  For a given GPA, a guest page can be assigned to a
> > >         private GPA or a shared GPA exclusively.  With TDX MapGPA hypercall,
> > >         guest TD converts GPA assignments from private (or shared) to shared (or
> > >         private).
> > >         This layer starts with
> > >         "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
> > > 
> > > KVM guest private memory: (not shown in the above diagram)
> > > [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
> > > memory: https://lkml.org/lkml/2022/1/18/395
> > >         Guest private memory requires different memory management in KVM.  The
> > >         patch proposes a way for it.  Integration with TDX KVM.
> > > 
> > > (***)
> > > * TDX module
> > > A CPU-attested software module called the "TDX module" is designed to implement
> > > the TDX architecture, and it is loaded by the UEFI firmware today. It can be
> > > loaded by the kernel or driver at runtime, but in this patch series we assume
> > > that the TDX module is already loaded and initialized.
> > > 
> > > The TDX module provides two main new logical modes of operation built upon the
> > > new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
> > > architecture. TDX root mode is mostly identical to the VMX root operation mode,
> > > and the TDX functions (described later) are triggered by the new SEAMCALL
> > > instruction with the desired interface function selected by an input operand
> > > (leaf number, in RAX). TDX non-root mode is used for TD guest operation.  TDX
> > > non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
> > > operation (i.e. guest VM), with changes and restrictions to better assure that
> > > no other software or hardware has direct visibility of the TD memory and state.
> > > 
> > > TDX transitions between TDX root operation and TDX non-root operation include TD
> > > Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
> > > TDX root mode.  A TD Exit might be asynchronous, triggered by some external
> > > event (e.g., external interrupt or SMI) or an exception, or it might be
> > > synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
> > > 
> > > TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
> > > of the TDX interface functions as mentioned above, and "TDH" stands for Trust
> > > Domain Host. Those host-side TDX interface functions are categorized into
> > > various areas just for better organization, such as SYS (TDX module management),
> > > MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
> > > etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
> > > 
> > > TDCS (Trust Domain Control Structure) is the main control structure of a guest
> > > TD, and encrypted (using the guest TD's ephemeral private key).  At a high
> > > level, TDCS holds information for controlling TD operation as a whole,
> > > execution, EPTP, MSR bitmaps, etc that KVM needs to set it up.  Note that MSR
> > > bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
> > > same value for all VCPUs of the same TD.
> > > 
> > > Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
> > > TD VCPU.  It helps the TDX module control the operation of the VCPU, and holds
> > > the VCPU state while the VCPU is not running. TDVPS is opaque to software and
> > > DMA access, accessible only by using the TDX module interface functions (such as
> > > TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
> > > such as virtual APIC page, virtualization exception information, etc.
> > > 
> > > Several VMX control structures (such as Shared EPT and Posted interrupt
> > > descriptor) are directly managed and accessed by the host VMM.  These control
> > > structures are pointed to by fields in the TD VMCS.
> > > 
> > > The above means that 1) KVM needs to allocate different data structures for TDs,
> > > 2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
> > > define TD-specific handling for others.  3) Redirect operations to .  3)
> > > Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
> > > tdx_callback() else vmx_callback();".
> > > 
> > > *TD Private Memory
> > > TD private memory is designed to hold TD private content, encrypted by the CPU
> > > using the TD ephemeral key. An encryption engine holds a table of encryption
> > > keys, and an encryption key is selected for each memory transaction based on a
> > > Host Key Identifier (HKID). By design, the host VMM does not have access to the
> > > encryption keys.
> > > 
> > > In the first generation of MKTME, HKID is "stolen" from the physical address by
> > > allocating a configurable number of bits from the top of the physical
> > > address. The HKID space is partitioned into shared HKIDs for legacy MKTME
> > > accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
> > > HKID on the host so that MKTME can be opaque or bypassed on the host.
> > > 
> > > During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
> > > as either shared or private, based on the value of a new SHARED bit in the Guest
> > > Physical Address (GPA).  The CPU translates shared GPAs using the usual VMX EPT
> > > (Extended Page Table) or "Shared EPT" (in this document), which resides in host
> > > VMM memory. The Shared EPT is directly managed by the host VMM - the same as
> > > with the current VMX. Since guest TDs usually require I/O, and the data exchange
> > > needs to be done via shared memory, thus KVM needs to use the current EPT
> > > functionality even for TDs.
> > > 
> > > * Secure EPT and Minoring using the TDP code
> > > The CPU translates private GPAs using a separate Secure EPT.  The Secure EPT
> > > pages are encrypted and integrity-protected with the TD's ephemeral private
> > > key.  Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
> > > interface functions, and thus conceptually Secure EPT is a subset of EPT (why
> > > "subset"). Since execution of such interface functions takes much longer time
> > > than accessing memory directly, in KVM we use the existing TDP code to minor the
> > > Secure EPT for the TD.
> > > 
> > > This way, we can effectively walk Secure EPT without using the TDX interface
> > > functions.
> > > 
> > > * VM life cycle and TDX specific operations
> > > The userspace VMM, such as QEMU, needs to build and treat TDs differently.  For
> > > example, a TD needs to boot in private memory, and the host software cannot copy
> > > the initial image to private memory.
> > > 
> > > * TSC Virtualization
> > > The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
> > > (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
> > > by TD configuration, i.e. when the TD is created, not per VCPU.  The current KVM
> > > owns TSC virtualization for VMs, but the TDX module does for TDs.
> > > 
> > > * MCE support for TDs
> > > The TDX module doesn't allow VMM to inject MCE.  Instead PV way is needed for TD
> > > to communicate with VMM.  For now, KVM silently ignores MCE request by VMM.  MSRs
> > > related to MCE (e.g, MCE bank registers) can be naturally emulated by
> > > paravirtualizing MSR access.
> > > 
> > > [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> > > available.
> > > 
> > > * Restrictions or future work
> > > Some features are not included to reduce patch size.  Those features are
> > > addressed as future independent patch series.
> > > - large page (2M, 1G)
> > > - qemu gdb stub
> > > - guest PMU
> > > - and more
> > > 
> > > * Prerequisites
> > > It's required to load the TDX module and initialize it.  It's out of the scope
> > > of this patch series.  Another independent patch for the common x86 code is
> > > planned.  It defines CONFIG_INTEL_TDX_HOST and this patch series uses
> > > CONFIG_INTEL_TDX_HOST.  It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
> > > module is initialized and ready for KVM to use the TDX module APIs for TDX guest
> > > life cycle like tdh.mng.init are ready to use.
> > > 
> > > Concretely Global initialization, LP (Logical Processor) initialization, global
> > > configuration, the key configuration, and TDMR and PAMT initialization are done.
> > > The state of the TDX module is SYS_READY.  Please refer to the TDX module
> > > specification, the chapter Intel TDX Module Lifecycle State Machine
> > > 
> > > ** Detecting the TDX module readiness.
> > > TDX host patch series implements the detection of the TDX module availability
> > > and its initialization so that KVM can use it.  Also it manages Host KeyID
> > > (HKID) assigned to guest TD.
> > > The assumed APIs the TDX host patch series provides are
> > > - int seamrr_enabled()
> > >   Check if required cpu feature (SEAM mode) is available. This only check CPU
> > >   feature availability.  At this point, the TDX module may not be ready for KVM
> > >   to use.
> > > - int init_tdx(void);
> > >   Initialization of TDX module so that the TDX module is ready for KVM to use.
> > > - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
> > >   Return the system wide information about the TDX module.  NULL if the TDX
> > >   isn't initialized.
> > > - u32 tdx_get_global_keyid(void);
> > >   Return global key id that is used for the TDX module itself.
> > > - int tdx_keyid_alloc(void);
> > >   Allocate HKID for guest TD.
> > > - void tdx_keyid_free(int keyid);
> > >   Free HKID for guest TD.
> > > 
> > > (****)
> > > * TDX KVM high-level design
> > > - Host key ID management
> > > Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
> > > It is assumed The TDX host patch series implements necessary functions,
> > > u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
> > > void tdx_keyid_free(int keyid).
> > > 
> > > - Data structures and VM type
> > > Because TDX is different from VMX, define its own VM/VCPU structures, struct
> > > kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx.  To
> > > identify the VM, introduce VM-type to specify which VM type, VMX (default) or
> > > TDX, is used.
> > > 
> > > - VM life cycle and TDX specific operations
> > > Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
> > > New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
> > > parameters, set initial guest memory and measurement.
> > > 
> > > The creation of TDX VM requires five additional operations in addition to the
> > > conventional VM creation.
> > >   - Get KVM system capability to check if TDX VM type is supported
> > >   - VM creation (KVM_CREATE_VM)
> > >   - New: Get the TDX specific system parameters.  KVM_TDX_GET_CAPABILITY.
> > >   - New: Set TDX specific VM parameters.  KVM_TDX_INIT_VM.
> > >   - VCPU creation (KVM_CREATE_VCPU)
> > >   - New: Set TDX specific VCPU parameters.  KVM_TDX_INIT_VCPU.
> > >   - New: Initialize guest memory as boot state and extend the measurement with
> > >     the memory.  KVM_TDX_INIT_MEM_REGION.
> > >   - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
> > >     TDX VM contents.
> > >   - VCPU RUN (KVM_VCPU_RUN)
> > > 
> > > - Protected guest state
> > > Because the guest state (CPU state and guest memory) is protected, the KVM VMM
> > > can't operate on them.  For example, accessing CPU registers, injecting
> > > exceptions, and accessing guest memory.  Those operations are handled as
> > > silently ignored, returning zero or initial reset value when it's requested via
> > > KVM API ioctls.
> > > 
> > >     VM/VCPU state and callbacks for TDX specific operations.
> > >     Define tdx specific VM state and VCPU state instead of VMX ones.  Redirect
> > >     operations to TDX specific callbacks.  "if (tdx) tdx_op() else vmx_op()".
> > > 
> > >     Operations on the CPU state
> > >     silently ignore operations on the guest state.  For example, the write to
> > >     CPU registers is ignored and the read from CPU registers returns 0.
> > > 
> > >     . ignore access to CPU registers except for allowed ones.
> > >     . TSC: add a check if tsc is immutable and return an error.  Because the KVM
> > >       implementation updates the internal tsc state and it's difficult to back
> > >       out those changes.  Instead, skip the logic.
> > >     . dirty logging: add check if dirty logging is supported.
> > >     . exceptions/SMI/MCE/SIPI/INIT: silently ignore
> > > 
> > >     Note: virtual external interrupt and NMI can be injected into TDX guests.
> > > 
> > > - KVM MMU integration
> > > One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
> > > the guest physical address is private (the bit is cleared) or shared (the bit is
> > > set).  The bits are called stolen bits.
> > > 
> > >   - Stolen bits framework
> > >     systematically tracks which guest physical address, shared or private, is
> > >     used.
> > > 
> > >   - Shared EPT and secure EPT
> > >     There are two EPTs. Shared EPT (the conventional one) and Secure
> > >     EPT(the new one). Shared EPT is handled the same for the stolen
> > >     bit set.  Secure EPT points to private guest pages.  To resolve
> > >     EPT violation, KVM walks one of two EPTs based on faulted GPA.
> > >     Because it's costly to access secure EPT during walking EPTs with
> > >     SEAMCALLs for the private guest physical address, another private
> > >     EPT is used as a shadow of Secure-EPT with the existing logic at
> > >     the cost of extra memory.
> > > 
> > > The following depicts the relationship.
> > > 
> > >                     KVM                             |       TDX module
> > >                      |                              |           |
> > >         -------------+----------                    |           |
> > >         |                      |                    |           |
> > >         V                      V                    |           |
> > >      shared GPA           private GPA               |           |
> > >   CPU shared EPT pointer  KVM private EPT pointer   |  CPU secure EPT pointer
> > >         |                      |                    |           |
> > >         |                      |                    |           |
> > >         V                      V                    |           V
> > >   shared EPT                private EPT--------mirror----->Secure EPT
> > >         |                      |                    |           |
> > >         |                      \--------------------+------\    |
> > >         |                                           |      |    |
> > >         V                                           |      V    V
> > >   shared guest page                                 |    private guest page
> > >                                                     |
> > >                                                     |
> > >                               non-encrypted memory  |    encrypted memory
> > >                                                     |
> > > 
> > >   - Operating on Secure EPT
> > >     Use the TDX module APIs to operate on Secure EPT.  To call the TDX API
> > >     during resolving EPT violation, add hooks to additional operation and wiring
> > >     it to TDX backend.
> > > 
> > > * References
> > > 
> > > [1] TDX specification
> > >    https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> > > [2] Intel Trust Domain Extensions (Intel TDX)
> > >    https://cdrdv2.intel.com/v1/dl/getContent/726790
> > > [3] Intel CPU Architectural Extensions Specification
> > >    https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> > > [4] Intel TDX Module 1.0 Specification
> > >    https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
> > > [5] Intel TDX Loader Interface Specification
> > >   https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> > > [6] Intel TDX Guest-Hypervisor Communication Interface
> > >    https://cdrdv2.intel.com/v1/dl/getContent/726790
> > > [7] Intel TDX Virtual Firmware Design Guide
> > >    https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
> > > [8] intel public github
> > >    kvm TDX branch: https://github.com/intel/tdx/tree/kvm
> > >    TDX guest branch: https://github.com/intel/tdx/tree/guest
> > >    qemu TDX https://github.com/intel/qemu-tdx
> > > [9] TDVF
> > >     https://github.com/tianocore/edk2-staging/tree/TDVF
> > >     This was merged into EDK2 main branch. https://github.com/tianocore/edk2
> > > 
> > > Chao Gao (3):
> > >   KVM: x86: Move check_processor_compatibility from init ops to runtime
> > >     ops
> > >   Partially revert "KVM: Pass kvm_init()'s opaque param to additional
> > >     arch funcs"
> > >   KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
> > >     wrmsr
> > > 
> > > Isaku Yamahata (72):
> > >   KVM: Refactor CPU compatibility check on module initialiization
> > >   x86/virt/vmx/tdx: export platform_tdx_enabled()
> > >   KVM: TDX: Detect CPU feature on kernel module initialization
> > >   KVM: x86: Refactor KVM VMX module init/exit functions
> > >   KVM: TDX: Add placeholders for TDX VM/vcpu structure
> > >   x86/virt/tdx: Add a helper function to return system wide info about
> > >     TDX module
> > >   KVM: TDX: Initialize TDX module when loading kvm_intel.ko
> > >   KVM: TDX: Make TDX VM type supported
> > >   [MARKER] The start of TDX KVM patch series: TDX architectural
> > >     definitions
> > >   KVM: TDX: Define TDX architectural definitions
> > >   KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
> > >   KVM: TDX: Add helper functions to print TDX SEAMCALL error
> > >   [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
> > >   x86/cpu: Add helper functions to allocate/free TDX private host key id
> > >   KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
> > >   KVM: TDX: Make pmu_intel.c ignore guest TD case
> > >   [MARKER] The start of TDX KVM patch series: TD vcpu
> > >     creation/destruction
> > >   KVM: TDX: allocate/free TDX vcpu structure
> > >   KVM: TDX: allocate/free TDX vcpu structure
> > >   [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
> > >   KVM: x86/mmu: introduce config for PRIVATE KVM MMU
> > >   [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
> > >     TDX
> > >   KVM: x86/mmu: Disallow fast page fault on private GPA
> > >   KVM: VMX: Introduce test mode related to EPT violation VE
> > >   [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
> > >   KVM: x86/mmu: Focibly use TDP MMU for TDX
> > >   KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
> > >   KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
> > >   KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
> > >   [MARKER] The start of TDX KVM patch series: TDX EPT violation
> > >   KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
> > >   KVM: TDX: don't request KVM_REQ_APIC_PAGE_RELOAD
> > >   KVM: TDX: TDP MMU TDX support
> > >   [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
> > >   KVM: x86/mmu: steal software usable git to record if GFN is for shared
> > >     or not
> > >   KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
> > >   [MARKER] The start of TDX KVM patch series: TD finalization
> > >   KVM: TDX: Create initial guest memory
> > >   KVM: TDX: Finalize VM initialization
> > >   [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
> > >   KVM: TDX: Add helper assembly function to TDX vcpu
> > >   KVM: TDX: Implement TDX vcpu enter/exit path
> > >   KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
> > >   KVM: TDX: restore host xsave state when exit from the guest TD
> > >   KVM: TDX: restore user ret MSRs
> > >   [MARKER] The start of TDX KVM patch series: TD vcpu
> > >     exits/interrupts/hypercalls
> > >   KVM: TDX: complete interrupts after tdexit
> > >   KVM: TDX: restore debug store when TD exit
> > >   KVM: TDX: handle vcpu migration over logical processor
> > >   KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
> > >     behavior
> > >   KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
> > >   KVM: TDX: Implement interrupt injection
> > >   KVM: TDX: Implements vcpu request_immediate_exit
> > >   KVM: TDX: Implement methods to inject NMI
> > >   KVM: TDX: Add a place holder to handle TDX VM exit
> > >   KVM: TDX: handle EXIT_REASON_OTHER_SMI
> > >   KVM: TDX: handle ept violation/misconfig exit
> > >   KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
> > >   KVM: TDX: Add a place holder for handler of TDX hypercalls
> > >     (TDG.VP.VMCALL)
> > >   KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
> > >   KVM: TDX: Handle TDX PV CPUID hypercall
> > >   KVM: TDX: Handle TDX PV HLT hypercall
> > >   KVM: TDX: Handle TDX PV port io hypercall
> > >   KVM: TDX: Implement callbacks for MSR operations for TDX
> > >   KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
> > >   KVM: TDX: Handle TDX PV report fatal error hypercall
> > >   KVM: TDX: Handle TDX PV map_gpa hypercall
> > >   KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
> > >   KVM: TDX: Silently discard SMI request
> > >   KVM: TDX: Silently ignore INIT/SIPI
> > >   Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
> > >   KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
> > > 
> > > Rick Edgecombe (1):
> > >   KVM: x86/mmu: Add address conversion functions for TDX shared bits
> > > 
> > > Sean Christopherson (25):
> > >   KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> > >   KVM: Enable hardware before doing arch VM initialization
> > >   KVM: x86: Introduce vm_type to differentiate default VMs from
> > >     confidential VMs
> > >   KVM: TDX: Add TDX "architectural" error codes
> > >   KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
> > >   KVM: TDX: create/destroy VM structure
> > >   KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
> > >   KVM: TDX: Do TDX specific vcpu initialization
> > >   KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
> > >   KVM: x86/mmu: Allow non-zero value for non-present SPTE
> > >   KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
> > >   KVM: x86/mmu: Allow per-VM override of the TDP max page level
> > >   KVM: x86/mmu: Zap only leaf SPTEs for deleted/moved memslot for
> > >     private mmu
> > >   KVM: x86/mmu: Disallow dirty logging for x86 TDX
> > >   KVM: VMX: Split out guts of EPT violation to common/exposed function
> > >   KVM: VMX: Move setting of EPT MMU masks to common VT-x code
> > >   KVM: TDX: Add load_mmu_pgd method for TDX
> > >   KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
> > >   KVM: TDX: Add support for find pending IRQ in a protected local APIC
> > >   KVM: x86: Assume timer IRQ was injected if APIC state is proteced
> > >   KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
> > >     argument
> > >   KVM: VMX: Move NMI/exception handler to common helper
> > >   KVM: x86: Split core of hypercall emulation to helper function
> > >   KVM: TDX: Handle TDX PV MMIO hypercall
> > >   KVM: TDX: Add methods to ignore accesses to CPU state
> > > 
> > > Xiaoyao Li (1):
> > >   KVM: TDX: initialize VM with TDX specific parameters
> > > 
> > >  Documentation/virt/kvm/api.rst                |   30 +-
> > >  .../virt/kvm/intel-tdx-layer-status.rst       |   33 +
> > >  Documentation/virt/kvm/intel-tdx.rst          |  381 +++
> > >  Documentation/virt/kvm/tdx-tdp-mmu.rst        |  466 ++++
> > >  arch/arm64/kvm/arm.c                          |    2 +-
> > >  arch/mips/kvm/mips.c                          |   14 +-
> > >  arch/powerpc/kvm/powerpc.c                    |    2 +-
> > >  arch/riscv/kvm/main.c                         |    2 +-
> > >  arch/s390/kvm/kvm-s390.c                      |    2 +-
> > >  arch/x86/events/intel/ds.c                    |    1 +
> > >  arch/x86/include/asm/kvm-x86-ops.h            |   10 +
> > >  arch/x86/include/asm/kvm_host.h               |   56 +-
> > >  arch/x86/include/asm/tdx.h                    |   67 +
> > >  arch/x86/include/asm/vmx.h                    |   14 +
> > >  arch/x86/include/uapi/asm/kvm.h               |   95 +
> > >  arch/x86/include/uapi/asm/vmx.h               |    5 +-
> > >  arch/x86/kvm/Kconfig                          |    4 +
> > >  arch/x86/kvm/Makefile                         |    3 +-
> > >  arch/x86/kvm/irq.c                            |    3 +
> > >  arch/x86/kvm/lapic.c                          |   37 +-
> > >  arch/x86/kvm/lapic.h                          |    2 +
> > >  arch/x86/kvm/mmu.h                            |   42 +-
> > >  arch/x86/kvm/mmu/mmu.c                        |  360 ++-
> > >  arch/x86/kvm/mmu/mmu_internal.h               |  123 +-
> > >  arch/x86/kvm/mmu/paging_tmpl.h                |    5 +-
> > >  arch/x86/kvm/mmu/spte.c                       |   46 +-
> > >  arch/x86/kvm/mmu/spte.h                       |   65 +-
> > >  arch/x86/kvm/mmu/tdp_iter.c                   |    1 +
> > >  arch/x86/kvm/mmu/tdp_iter.h                   |    5 +-
> > >  arch/x86/kvm/mmu/tdp_mmu.c                    |  690 ++++-
> > >  arch/x86/kvm/mmu/tdp_mmu.h                    |   12 +-
> > >  arch/x86/kvm/svm/svm.c                        |   13 +-
> > >  arch/x86/kvm/vmx/common.h                     |  174 ++
> > >  arch/x86/kvm/vmx/evmcs.c                      |    2 +-
> > >  arch/x86/kvm/vmx/evmcs.h                      |    2 +-
> > >  arch/x86/kvm/vmx/main.c                       | 1071 +++++++
> > >  arch/x86/kvm/vmx/pmu_intel.c                  |   39 +-
> > >  arch/x86/kvm/vmx/pmu_intel.h                  |   28 +
> > >  arch/x86/kvm/vmx/posted_intr.c                |   43 +-
> > >  arch/x86/kvm/vmx/posted_intr.h                |   13 +
> > >  arch/x86/kvm/vmx/tdx.c                        | 2465 +++++++++++++++++
> > >  arch/x86/kvm/vmx/tdx.h                        |  275 ++
> > >  arch/x86/kvm/vmx/tdx_arch.h                   |  157 ++
> > >  arch/x86/kvm/vmx/tdx_errno.h                  |   29 +
> > >  arch/x86/kvm/vmx/tdx_error.c                  |   22 +
> > >  arch/x86/kvm/vmx/tdx_ops.h                    |  188 ++
> > >  arch/x86/kvm/vmx/vmenter.S                    |  146 +
> > >  arch/x86/kvm/vmx/vmx.c                        |  737 ++---
> > >  arch/x86/kvm/vmx/vmx.h                        |   39 +-
> > >  arch/x86/kvm/vmx/x86_ops.h                    |  235 ++
> > >  arch/x86/kvm/x86.c                            |  148 +-
> > >  arch/x86/virt/vmx/tdx/seamcall.S              |    2 +
> > >  arch/x86/virt/vmx/tdx/tdx.c                   |   54 +-
> > >  arch/x86/virt/vmx/tdx/tdx.h                   |   52 -
> > >  include/linux/kvm_host.h                      |    4 +-
> > >  include/uapi/linux/kvm.h                      |    2 +
> > >  tools/arch/x86/include/uapi/asm/kvm.h         |   95 +
> > >  tools/include/uapi/linux/kvm.h                |    1 +
> > >  virt/kvm/kvm_main.c                           |   67 +-
> > >  59 files changed, 7877 insertions(+), 804 deletions(-)
> > >  create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
> > >  create mode 100644 Documentation/virt/kvm/intel-tdx.rst
> > >  create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
> > >  create mode 100644 arch/x86/kvm/vmx/common.h
> > >  create mode 100644 arch/x86/kvm/vmx/main.c
> > >  create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
> > >  create mode 100644 arch/x86/kvm/vmx/tdx.c
> > >  create mode 100644 arch/x86/kvm/vmx/tdx.h
> > >  create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
> > >  create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> > >  create mode 100644 arch/x86/kvm/vmx/tdx_error.c
> > >  create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
> > >  create mode 100644 arch/x86/kvm/vmx/x86_ops.h
> > > 
> > > -- 
> > > 2.25.1
> > > 
> > 
> > -- 
> > Isaku Yamahata <isaku.yamahata@gmail.com>
Chao Peng July 13, 2022, 7:37 a.m. UTC | #7
On Tue, Jul 12, 2022 at 10:22:50AM -0700, Isaku Yamahata wrote:
> On Tue, Jul 12, 2022 at 06:54:19PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > On Tue, Jul 12, 2022 at 01:07:20PM +0800, Chao Gao wrote:
> > > On Mon, Jul 11, 2022 at 08:17:01AM -0700, Isaku Yamahata wrote:
> > > >Hi. Because my description on large page support was terse, I wrote up more
> > > >detailed one.  Any feedback/thoughts on large page support?
> > > >
> > > >TDP MMU large page support design
> > > >
> > > >Two main discussion points
> > > >* how to track page status. private vs shared, no-largepage vs can-be-largepage
> > > 
> > > ...
> > > 
> > > >
> > > >Tracking private/shared and large page mappable
> > > >-----------------------------------------------
> > > >VMM needs to track that page is mapped as private or shared at 4KB granularity.
> > > >For efficiency of EPT violation path (****), at 2MB and 1GB level, VMM should
> > > >track the page can be mapped as a large page (regarding private/shared).  VMM
> > > >updates it on MapGPA and references it on the EPT violation path. (****)
> > > 
> > > Isaku,
> > > 
> > > + Peng Chao
> > > 
> > > Doesn't UPM guarantee that 2MB/1GB large page in CR3 should be either all
> > > private or all shared?
> > > 
> > > KVM always retrieves the mapping level in CR3 and enforces that EPT's
> > > page level is not greater than that in CR3. My point is if UPM already enforces
> > > no mixed pages in a large page, then KVM needn't do that again (UPM can
> > > be trusted).
> > 
> > The backing store in the UMP can tell KVM which page level it can
> > support for a given private gpa, similar to host_pfn_mapping_level() for
> > shared address.
> >
> > However, this solely represents the backing store's capability, KVM
> > still needs additional info to decide whether that can be safely mapped
> > as 2M/1G, e.g. all the following pages in the 2M/1G range should be all
> > private, currently this is not something backing store can tell.
> 
> This argument applies to shared GPA.  The shared pages is backed by normal file
> mapping with UPM.  When KVM is mapping shared GPA, the same check is needed.  So
> I think KVM has to track all private or all shared or no-largepage at 2MB/1GB
> level.  If UPM tracks shared-or-private at 4KB level, probably KVM may not need to
> track it at 4KB level.

Right, the same also applies to shared memory. All the info we need is
whether pages of a 2M range is all private/shared but not mixed. UPM v7
has code tracking that in KVM and previously versions we track that in
the backing store which has been discussed not a good idea.

Chao
> 
> 
> > Actually, in UPM v7 we let KVM record this info so one possible solution
> > is making use of it.
> > 
> >   https://lkml.org/lkml/2022/7/6/259
> > 
> > Then to map a page as 2M, KVM needs to check:
> >   - Memory backing store support that level
> >   - All pages in 2M range are private as we recorded through
> >     KVM_MEMORY_ENCRYPT_{UN,}REG_REGION
> >   - No existing partial 4K map(s) in 2M range
> -- 
> Isaku Yamahata <isaku.yamahata@gmail.com>
Sean Christopherson July 14, 2022, 1:03 a.m. UTC | #8
On Mon, Jun 27, 2022, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> KVM TDX basic feature support
> 
> Hello.  This is v7 the patch series vof KVM TDX support.
> This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
> The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
> How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM
> 
> Major changes from v6:
> - rebased to v5.19 base
> 
> TODO:
> - integrate fd-based guest memory. As the discussion is still on-going, I
>   intentionally dropped fd-based guest memory support yet.  The integration can
>   be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
> - 2M large page support. It's work-in-progress.
> For large page support, there are several design choices. Here is the design options.
> Any thoughts/feedback?

Apologies, I didn't read beyond the intro paragraph.  In case something like this
comes up again, it's probably best to send a standalone email tagged RFC, I doubt
I'm the only one that missed this embedded RFC.

> KVM MMU Large page support for TDX
 
...

> * options to track private or shared
> At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
> 1GB case). For 4KB each page, 1 bit per page is needed. private or shared.  For
> large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
> mixed).  When resolving KVM page fault, we don't want to check the lower-size
> pages to check if the given GPA can be a large for performance.  On MapGPA check
> it instead.
> 
> Option A). enhance kvm_arch_memory_slot
>   enum kvm_page_type {
>        KVM_PAGE_TYPE_INVALID,
>        KVM_PAGE_TYPE_SHARED,
>        KVM_PAGE_TYPE_PRIVATE,
>        KVM_PAGE_TYPE_MIXED,
>   };
> 
>   struct kvm_page_attr {
>        enum kvm_page_type type;
>   };
> 
>  struct kvm_arch_memory_slot {
>  +      struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
> 
> Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
> If !SPTE_MIXED_MASK, it can be large page.
> 
> Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
> kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.
> 
> 
> * comparison
> A).
> + straightforward to implement
> + SPTE_SHARED_MASK isn't needed
> - memory overhead compared to B). or C).
> - more memory reference on KVM page fault
> 
> B).
> + simpler than C) (complex than A)?)
> + efficient on KVM page fault. (only SPTE reference)
> + low memory overhead
> - Waste precious SPTE bits.
> 
> C).
> + efficient on KVM page fault. (only SPTE reference)
> + low memory overhead
> - complicates MapGPA
> - scattered data structure

Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
on insertion/removal to (dis)allow hugepages as needed.

  + efficient on KVM page fault (no new lookups)
  + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
  + straightforward to implement
  + can (and should) be merged as part of the UPM series

I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
completely covered (fully shared) or not covered at all (fully private), but I'm
not 100% certain that xa_for_each_range() works the way I think it does.
Xiaoyao Li July 14, 2022, 4:09 a.m. UTC | #9
On 7/14/2022 9:03 AM, Sean Christopherson wrote:
> On Mon, Jun 27, 2022, isaku.yamahata@intel.com wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> KVM TDX basic feature support
>>
>> Hello.  This is v7 the patch series vof KVM TDX support.
>> This is based on v5.19-rc1 + kvm/queue branch + TDX HOST patch series.
>> The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
>> How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM
>>
>> Major changes from v6:
>> - rebased to v5.19 base
>>
>> TODO:
>> - integrate fd-based guest memory. As the discussion is still on-going, I
>>    intentionally dropped fd-based guest memory support yet.  The integration can
>>    be found at https://github.com/intel/tdx/tree/kvm-upstream-workaround.
>> - 2M large page support. It's work-in-progress.
>> For large page support, there are several design choices. Here is the design options.
>> Any thoughts/feedback?
> 
> Apologies, I didn't read beyond the intro paragraph.  In case something like this
> comes up again, it's probably best to send a standalone email tagged RFC, I doubt
> I'm the only one that missed this embedded RFC.
> 
>> KVM MMU Large page support for TDX
>   
> ...
> 
>> * options to track private or shared
>> At each page size (4KB, 2MB, and 1GB), track private, shared, or mixed (2MB and
>> 1GB case). For 4KB each page, 1 bit per page is needed. private or shared.  For
>> large pages (2MB and 1GB), 2 bits per large page is needed. (private, shared, or
>> mixed).  When resolving KVM page fault, we don't want to check the lower-size
>> pages to check if the given GPA can be a large for performance.  On MapGPA check
>> it instead.
>>
>> Option A). enhance kvm_arch_memory_slot
>>    enum kvm_page_type {
>>         KVM_PAGE_TYPE_INVALID,
>>         KVM_PAGE_TYPE_SHARED,
>>         KVM_PAGE_TYPE_PRIVATE,
>>         KVM_PAGE_TYPE_MIXED,
>>    };
>>
>>    struct kvm_page_attr {
>>         enum kvm_page_type type;
>>    };
>>
>>   struct kvm_arch_memory_slot {
>>   +      struct kvm_page_attr *page_attr[KVM_NR_PAGE_SIZES];
>>
>> Option B). steal one more bit SPTE_MIXED_MASK in addition to SPTE_SHARED_MASK
>> If !SPTE_MIXED_MASK, it can be large page.

I don't think this is a good option, since it requires all the mappings 
exist all the time both in shared spte tree and private spte tree.

>> Option C). use SPTE_SHARED_MASK and kvm_mmu_page::mixed bitmap
>> kvm_mmu_page::mixed bitmap of 1GB, root indicates mixed for 2MB, 1GB.
>>
>>
>> * comparison
>> A).
>> + straightforward to implement
>> + SPTE_SHARED_MASK isn't needed
>> - memory overhead compared to B). or C).
>> - more memory reference on KVM page fault
>>
>> B).
>> + simpler than C) (complex than A)?)
>> + efficient on KVM page fault. (only SPTE reference)
>> + low memory overhead
>> - Waste precious SPTE bits.
>>
>> C).
>> + efficient on KVM page fault. (only SPTE reference)
>> + low memory overhead
>> - complicates MapGPA
>> - scattered data structure
> 
> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
> on insertion/removal to (dis)allow hugepages as needed.

UPM v7[1] introduces "struct xarray mem_attr_array" to track the 
shared/private attr of a range.

So in kvm_vm_ioctl_set_encrypted_region() it needs to

- increase the lpage_info counter when a 2m/1g range changed from 
identical to mixed, and

- decrease the counter when mixed -> identical

[1]: 
https://lore.kernel.org/all/20220706082016.2603916-12-chao.p.peng@linux.intel.com/

> 
>    + efficient on KVM page fault (no new lookups)
>    + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
>    + straightforward to implement
>    + can (and should) be merged as part of the UPM series
> 
> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
> completely covered (fully shared) or not covered at all (fully private), but I'm
> not 100% certain that xa_for_each_range() works the way I think it does.
Chao Peng July 20, 2022, 2:59 p.m. UTC | #10
On Thu, Jul 14, 2022 at 01:03:46AM +0000, Sean Christopherson wrote:
...
> 
> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
> on insertion/removal to (dis)allow hugepages as needed.
> 
>   + efficient on KVM page fault (no new lookups)
>   + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
>   + straightforward to implement
>   + can (and should) be merged as part of the UPM series
> 
> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
> completely covered (fully shared) or not covered at all (fully private), but I'm
> not 100% certain that xa_for_each_range() works the way I think it does.

Hi Sean,

Below is the implementation to support 2M as you mentioned as option D.
It's based on UPM v7 xarray code: https://lkml.org/lkml/2022/7/6/259

Everything sounds good, the only trick bit is inc/dec disallow_lpage. If
we still treat it as a count, it will be a challenge to make the inc/dec
balanced. So in this patch I stole a bit for the purpose, looks ugly.

Any feedback is welcome.

Thanks,
Chao

-----------------------------------------------------------------------
From: Chao Peng <chao.p.peng@linux.intel.com>
Date: Wed, 20 Jul 2022 11:37:18 +0800
Subject: [PATCH] KVM: Add large page support for private memory

Update lpage_info when handling KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.

Reserve a bit in disallow_lpage to indicate a large page has
private/share pages mixed.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |   8 +++
 arch/x86/kvm/mmu/mmu.c          | 120 +++++++++++++++++++++++++++++++-
 include/linux/kvm_host.h        |  14 ++++
 virt/kvm/kvm_main.c             |  12 +++-
 4 files changed, 150 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d460b8511041..b6ffe8b1c547 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -38,6 +38,7 @@
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
 #define __KVM_HAVE_ZAP_GFN_RANGE
+#define __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
 
 #define KVM_MAX_VCPUS 1024
 
@@ -935,6 +936,13 @@ struct kvm_vcpu_arch {
 #endif
 };
 
+/*
+ * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
+ * level. The remaining bits will be used as a reference count for other users.
+ */
+#define KVM_LPAGE_PRIVATE_SHARED_MIXED		(1U << 31)
+#define KVM_LPAGE_COUNT_MAX 			((1U << 31) - 1)
+
 struct kvm_lpage_info {
 	int disallow_lpage;
 };
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 771ffd147e77..d040eeaf1f1c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -843,11 +843,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
 {
 	struct kvm_lpage_info *linfo;
 	int i;
+	int disallow_count;
 
 	for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
 		linfo = lpage_info_slot(gfn, slot, i);
+
+		disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
+		WARN_ON(disallow_count + count < 0 ||
+			disallow_count > KVM_LPAGE_COUNT_MAX - count);
+
 		linfo->disallow_lpage += count;
-		WARN_ON(linfo->disallow_lpage < 0);
 	}
 }
 
@@ -7246,3 +7251,116 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 	if (kvm->arch.nx_lpage_recovery_thread)
 		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
 }
+
+static bool mem_attr_is_mixed(struct kvm *kvm, unsigned int attr,
+			      gfn_t start, gfn_t end)
+{
+	XA_STATE(xas, &kvm->mem_attr_array, start);
+	gfn_t gfn = start;
+	void *entry;
+	bool shared, private;
+	bool mixed = false;
+
+	if (attr == KVM_MEM_ATTR_SHARED) {
+		shared = true;
+		private = false;
+	} else {
+		shared = false;
+		private = true;
+	}
+
+	rcu_read_lock();
+	entry = xas_load(&xas);
+	while (gfn < end) {
+		if (xas_retry(&xas, entry))
+			continue;
+
+		KVM_BUG_ON(gfn != xas.xa_index, kvm);
+
+		if (entry)
+			private = true;
+		else
+			shared = true;
+
+		if (private && shared) {
+			mixed = true;
+			goto out;
+		}
+
+		entry = xas_next(&xas);
+		gfn++;
+	}
+out:
+	rcu_read_unlock();
+	return mixed;
+}
+
+static inline void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
+{
+	if (mixed)
+		linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
+	else
+		linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static void update_mem_lpage_info(struct kvm *kvm,
+				  struct kvm_memory_slot *slot,
+				  unsigned int attr,
+				  gfn_t start, gfn_t end)
+{
+	unsigned long lpage_start, lpage_end;
+	unsigned long gfn, pages, mask;
+	int level;
+
+	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+		pages = KVM_PAGES_PER_HPAGE(level);
+		mask = ~(pages - 1);
+		lpage_start = start & mask;
+		lpage_end = end & mask;
+
+		/*
+		 * We only need to scan the head and tail page, for middle pages
+		 * we know they are not mixed.
+		 */
+		update_mixed(lpage_info_slot(lpage_start, slot, level),
+			     mem_attr_is_mixed(kvm, attr, lpage_start,
+							  lpage_start + pages));
+
+		if (lpage_start == lpage_end)
+			return;
+
+		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages) {
+			update_mixed(lpage_info_slot(gfn, slot, level), false);
+		}
+
+		update_mixed(lpage_info_slot(lpage_end, slot, level),
+			     mem_attr_is_mixed(kvm, attr, lpage_end,
+							  lpage_end + pages));
+	}
+}
+
+void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+			      gfn_t start, gfn_t end)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	struct kvm_memslot_iter iter;
+	int i;
+
+	WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
+			"Unsupported mem attribute.\n");
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(kvm, i);
+
+		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+			slot = iter.slot;
+			start = max(start, slot->base_gfn);
+			end = min(end, slot->base_gfn + slot->npages);
+			if (WARN_ON_ONCE(start >= end))
+				continue;
+
+			update_mem_lpage_info(kvm, slot, attr, start, end);
+		}
+	}
+}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d45f00f5b3ee..7b18fcd71df5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2282,6 +2282,10 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
 #ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
+
+#define KVM_MEM_ATTR_SHARED	0x0001
+#define KVM_MEM_ATTR_PRIVATE	0x0002
+
 static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
 					  gfn_t gfn, kvm_pfn_t *pfn, int *order)
 {
@@ -2307,6 +2311,16 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 	return !!xa_load(&kvm->mem_attr_array, gfn);
 }
 
+#ifdef __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
+void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+			      gfn_t start, gfn_t end);
+#else
+static inline void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+					    gfn_t start, gfn_t end)
+{
+}
+#endif
+
 #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
 
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1ba4b9e5449c..1d22c8603f91 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -863,12 +863,12 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
 #ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
-#define KVM_MEM_ATTR_PRIVATE	0x0001
 static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
 					     struct kvm_enc_region *region)
 {
 	unsigned long start, end;
 	void *entry;
+	int attr;
 	int r;
 
 	if (region->size == 0 || region->addr + region->size < region->addr)
@@ -879,13 +879,19 @@ static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl
 	start = region->addr >> PAGE_SHIFT;
 	end = (region->addr + region->size - 1) >> PAGE_SHIFT;
 
-	entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
-				xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
+	if (ioctl == KVM_MEMORY_ENCRYPT_REG_REGION) {
+		attr = KVM_MEM_ATTR_PRIVATE;
+		entry = xa_mk_value(KVM_MEM_ATTR_PRIVATE);
+	} else {
+		attr = KVM_MEM_ATTR_SHARED;
+		entry = NULL;
+	}
 
 	r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
 					entry, GFP_KERNEL_ACCOUNT));
 
 	kvm_zap_gfn_range(kvm, start, end + 1);
+	kvm_arch_update_mem_attr(kvm, attr, start, end + 1);
 
 	return r;
 }
Nikunj A. Dadhania July 25, 2022, 1:46 p.m. UTC | #11
On 7/20/2022 8:29 PM, Chao Peng wrote:
> On Thu, Jul 14, 2022 at 01:03:46AM +0000, Sean Christopherson wrote:
> ...
>>
>> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
>> on insertion/removal to (dis)allow hugepages as needed.
>>
>>   + efficient on KVM page fault (no new lookups)
>>   + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
>>   + straightforward to implement
>>   + can (and should) be merged as part of the UPM series
>>
>> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
>> completely covered (fully shared) or not covered at all (fully private), but I'm
>> not 100% certain that xa_for_each_range() works the way I think it does.
> 
> Hi Sean,
> 
> Below is the implementation to support 2M as you mentioned as option D.
> It's based on UPM v7 xarray code: https://lkml.org/lkml/2022/7/6/259
> 
> Everything sounds good, the only trick bit is inc/dec disallow_lpage. If
> we still treat it as a count, it will be a challenge to make the inc/dec
> balanced. So in this patch I stole a bit for the purpose, looks ugly.
> 
> Any feedback is welcome.
> 
> Thanks,
> Chao
> 
> -----------------------------------------------------------------------
> From: Chao Peng <chao.p.peng@linux.intel.com>
> Date: Wed, 20 Jul 2022 11:37:18 +0800
> Subject: [PATCH] KVM: Add large page support for private memory
> 
> Update lpage_info when handling KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> 
> Reserve a bit in disallow_lpage to indicate a large page has
> private/share pages mixed.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---


> +static void update_mem_lpage_info(struct kvm *kvm,
> +				  struct kvm_memory_slot *slot,
> +				  unsigned int attr,
> +				  gfn_t start, gfn_t end)
> +{
> +	unsigned long lpage_start, lpage_end;
> +	unsigned long gfn, pages, mask;
> +	int level;
> +
> +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> +		pages = KVM_PAGES_PER_HPAGE(level);
> +		mask = ~(pages - 1);
> +		lpage_start = start & mask;
> +		lpage_end = end & mask;
> +
> +		/*
> +		 * We only need to scan the head and tail page, for middle pages
> +		 * we know they are not mixed.
> +		 */
> +		update_mixed(lpage_info_slot(lpage_start, slot, level),
> +			     mem_attr_is_mixed(kvm, attr, lpage_start,
> +							  lpage_start + pages));
> +
> +		if (lpage_start == lpage_end)
> +			return;
> +
> +		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages) {
> +			update_mixed(lpage_info_slot(gfn, slot, level), false);
> +		}

Boundary check missing here for the case when gfn reaches lpage_end.

		if (gfn == lpage_end)
			return;

> +
> +		update_mixed(lpage_info_slot(lpage_end, slot, level),
> +			     mem_attr_is_mixed(kvm, attr, lpage_end,
> +							  lpage_end + pages));
> +	}
> +}

Regards
Nikunj
Chao Peng July 26, 2022, 2:32 p.m. UTC | #12
On Mon, Jul 25, 2022 at 07:16:24PM +0530, Nikunj A. Dadhania wrote:
> On 7/20/2022 8:29 PM, Chao Peng wrote:
> > On Thu, Jul 14, 2022 at 01:03:46AM +0000, Sean Christopherson wrote:
> > ...
> >>
> >> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
> >> on insertion/removal to (dis)allow hugepages as needed.
> >>
> >>   + efficient on KVM page fault (no new lookups)
> >>   + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
> >>   + straightforward to implement
> >>   + can (and should) be merged as part of the UPM series
> >>
> >> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
> >> completely covered (fully shared) or not covered at all (fully private), but I'm
> >> not 100% certain that xa_for_each_range() works the way I think it does.
> > 
> > Hi Sean,
> > 
> > Below is the implementation to support 2M as you mentioned as option D.
> > It's based on UPM v7 xarray code: https://lkml.org/lkml/2022/7/6/259
> > 
> > Everything sounds good, the only trick bit is inc/dec disallow_lpage. If
> > we still treat it as a count, it will be a challenge to make the inc/dec
> > balanced. So in this patch I stole a bit for the purpose, looks ugly.
> > 
> > Any feedback is welcome.
> > 
> > Thanks,
> > Chao
> > 
> > -----------------------------------------------------------------------
> > From: Chao Peng <chao.p.peng@linux.intel.com>
> > Date: Wed, 20 Jul 2022 11:37:18 +0800
> > Subject: [PATCH] KVM: Add large page support for private memory
> > 
> > Update lpage_info when handling KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> > 
> > Reserve a bit in disallow_lpage to indicate a large page has
> > private/share pages mixed.
> > 
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> 
> 
> > +static void update_mem_lpage_info(struct kvm *kvm,
> > +				  struct kvm_memory_slot *slot,
> > +				  unsigned int attr,
> > +				  gfn_t start, gfn_t end)
> > +{
> > +	unsigned long lpage_start, lpage_end;
> > +	unsigned long gfn, pages, mask;
> > +	int level;
> > +
> > +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > +		pages = KVM_PAGES_PER_HPAGE(level);
> > +		mask = ~(pages - 1);
> > +		lpage_start = start & mask;
> > +		lpage_end = end & mask;
> > +
> > +		/*
> > +		 * We only need to scan the head and tail page, for middle pages
> > +		 * we know they are not mixed.
> > +		 */
> > +		update_mixed(lpage_info_slot(lpage_start, slot, level),
> > +			     mem_attr_is_mixed(kvm, attr, lpage_start,
> > +							  lpage_start + pages));
> > +
> > +		if (lpage_start == lpage_end)
> > +			return;
> > +
> > +		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages) {
> > +			update_mixed(lpage_info_slot(gfn, slot, level), false);
> > +		}
> 
> Boundary check missing here for the case when gfn reaches lpage_end.
> 
> 		if (gfn == lpage_end)
> 			return;

In this case, it's actually the tail page that I want to scan for with
below code.

It's also possible I misunderstand something here.

Chao
> 
> > +
> > +		update_mixed(lpage_info_slot(lpage_end, slot, level),
> > +			     mem_attr_is_mixed(kvm, attr, lpage_end,
> > +							  lpage_end + pages));
> > +	}
> > +}
> 
> Regards
> Nikunj
Nikunj A. Dadhania July 27, 2022, 9:26 a.m. UTC | #13
On 7/26/2022 8:02 PM, Chao Peng wrote:
> On Mon, Jul 25, 2022 at 07:16:24PM +0530, Nikunj A. Dadhania wrote:
>> On 7/20/2022 8:29 PM, Chao Peng wrote:
>>> On Thu, Jul 14, 2022 at 01:03:46AM +0000, Sean Christopherson wrote:
>>> ...
>>>>
>>>> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
>>>> on insertion/removal to (dis)allow hugepages as needed.
>>>>
>>>>   + efficient on KVM page fault (no new lookups)
>>>>   + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
>>>>   + straightforward to implement
>>>>   + can (and should) be merged as part of the UPM series
>>>>
>>>> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
>>>> completely covered (fully shared) or not covered at all (fully private), but I'm
>>>> not 100% certain that xa_for_each_range() works the way I think it does.
>>>
>>> Hi Sean,
>>>
>>> Below is the implementation to support 2M as you mentioned as option D.
>>> It's based on UPM v7 xarray code: https://lkml.org/lkml/2022/7/6/259
>>>
>>> Everything sounds good, the only trick bit is inc/dec disallow_lpage. If
>>> we still treat it as a count, it will be a challenge to make the inc/dec
>>> balanced. So in this patch I stole a bit for the purpose, looks ugly.
>>>
>>> Any feedback is welcome.
>>>
>>> Thanks,
>>> Chao
>>>
>>> -----------------------------------------------------------------------
>>> From: Chao Peng <chao.p.peng@linux.intel.com>
>>> Date: Wed, 20 Jul 2022 11:37:18 +0800
>>> Subject: [PATCH] KVM: Add large page support for private memory
>>>
>>> Update lpage_info when handling KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
>>>
>>> Reserve a bit in disallow_lpage to indicate a large page has
>>> private/share pages mixed.
>>>
>>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>>> ---
>>
>>
>>> +static void update_mem_lpage_info(struct kvm *kvm,
>>> +				  struct kvm_memory_slot *slot,
>>> +				  unsigned int attr,
>>> +				  gfn_t start, gfn_t end)
>>> +{
>>> +	unsigned long lpage_start, lpage_end;
>>> +	unsigned long gfn, pages, mask;
>>> +	int level;
>>> +
>>> +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
>>> +		pages = KVM_PAGES_PER_HPAGE(level);
>>> +		mask = ~(pages - 1);
>>> +		lpage_start = start & mask;
>>> +		lpage_end = end & mask;
>>> +
>>> +		/*
>>> +		 * We only need to scan the head and tail page, for middle pages
>>> +		 * we know they are not mixed.
>>> +		 */
>>> +		update_mixed(lpage_info_slot(lpage_start, slot, level),
>>> +			     mem_attr_is_mixed(kvm, attr, lpage_start,
>>> +							  lpage_start + pages));
>>> +
>>> +		if (lpage_start == lpage_end)
>>> +			return;
>>> +
>>> +		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages) {
>>> +			update_mixed(lpage_info_slot(gfn, slot, level), false);
>>> +		}
>>
>> Boundary check missing here for the case when gfn reaches lpage_end.
>>
>> 		if (gfn == lpage_end)
>> 			return;
> 
> In this case, it's actually the tail page that I want to scan for with
> below code.

What if you do not have the tail lpage?

For example: memslot base_gfn = 0x1000 and npages is 0x800, so memslot range
is 0x1000 to 0x17ff.

Assume a case when this function is called with start = 1000 and end = 1800.
For 2M, page mask is 0x1ff. start and end both are 2M aligned.

First update_mixed takes care of 0x1000-0x1200
Loop update_mixed: goes over from 0x1200 - 0x1800, there are no pages left
for last update_mixed to process.

> 
> It's also possible I misunderstand something here.
> 
> Chao
>>
>>> +
>>> +		update_mixed(lpage_info_slot(lpage_end, slot, level),
>>> +			     mem_attr_is_mixed(kvm, attr, lpage_end,
>>> +							  lpage_end + pages));

lpage_info_slot some times causes a crash, as I noticed that
lpage_info_slot() returns out-of-bound index.

Regards
Nikunj
Chao Peng Aug. 3, 2022, 10:48 a.m. UTC | #14
On Wed, Jul 27, 2022 at 02:56:40PM +0530, Nikunj A. Dadhania wrote:
> On 7/26/2022 8:02 PM, Chao Peng wrote:
> > On Mon, Jul 25, 2022 at 07:16:24PM +0530, Nikunj A. Dadhania wrote:
> >> On 7/20/2022 8:29 PM, Chao Peng wrote:
> >>> On Thu, Jul 14, 2022 at 01:03:46AM +0000, Sean Christopherson wrote:
> >>> ...
> >>>>
> >>>> Option D). track shared regions in an Xarray, update kvm_arch_memory_slot.lpage_info
> >>>> on insertion/removal to (dis)allow hugepages as needed.
> >>>>
> >>>>   + efficient on KVM page fault (no new lookups)
> >>>>   + zero memory overhead (assuming KVM has to eat the cost of the Xarray anyways)
> >>>>   + straightforward to implement
> >>>>   + can (and should) be merged as part of the UPM series
> >>>>
> >>>> I believe xa_for_each_range() can be used to see if a given 2mb/1gb range is
> >>>> completely covered (fully shared) or not covered at all (fully private), but I'm
> >>>> not 100% certain that xa_for_each_range() works the way I think it does.
> >>>
> >>> Hi Sean,
> >>>
> >>> Below is the implementation to support 2M as you mentioned as option D.
> >>> It's based on UPM v7 xarray code: https://lkml.org/lkml/2022/7/6/259
> >>>
> >>> Everything sounds good, the only trick bit is inc/dec disallow_lpage. If
> >>> we still treat it as a count, it will be a challenge to make the inc/dec
> >>> balanced. So in this patch I stole a bit for the purpose, looks ugly.
> >>>
> >>> Any feedback is welcome.
> >>>
> >>> Thanks,
> >>> Chao
> >>>
> >>> -----------------------------------------------------------------------
> >>> From: Chao Peng <chao.p.peng@linux.intel.com>
> >>> Date: Wed, 20 Jul 2022 11:37:18 +0800
> >>> Subject: [PATCH] KVM: Add large page support for private memory
> >>>
> >>> Update lpage_info when handling KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
> >>>
> >>> Reserve a bit in disallow_lpage to indicate a large page has
> >>> private/share pages mixed.
> >>>
> >>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> >>> ---
> >>
> >>
> >>> +static void update_mem_lpage_info(struct kvm *kvm,
> >>> +				  struct kvm_memory_slot *slot,
> >>> +				  unsigned int attr,
> >>> +				  gfn_t start, gfn_t end)
> >>> +{
> >>> +	unsigned long lpage_start, lpage_end;
> >>> +	unsigned long gfn, pages, mask;
> >>> +	int level;
> >>> +
> >>> +	for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> >>> +		pages = KVM_PAGES_PER_HPAGE(level);
> >>> +		mask = ~(pages - 1);
> >>> +		lpage_start = start & mask;
> >>> +		lpage_end = end & mask;
> >>> +
> >>> +		/*
> >>> +		 * We only need to scan the head and tail page, for middle pages
> >>> +		 * we know they are not mixed.
> >>> +		 */
> >>> +		update_mixed(lpage_info_slot(lpage_start, slot, level),
> >>> +			     mem_attr_is_mixed(kvm, attr, lpage_start,
> >>> +							  lpage_start + pages));
> >>> +
> >>> +		if (lpage_start == lpage_end)
> >>> +			return;
> >>> +
> >>> +		for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages) {
> >>> +			update_mixed(lpage_info_slot(gfn, slot, level), false);
> >>> +		}
> >>
> >> Boundary check missing here for the case when gfn reaches lpage_end.
> >>
> >> 		if (gfn == lpage_end)
> >> 			return;
> > 
> > In this case, it's actually the tail page that I want to scan for with
> > below code.
> 
> What if you do not have the tail lpage?
> 
> For example: memslot base_gfn = 0x1000 and npages is 0x800, so memslot range
> is 0x1000 to 0x17ff.
> 
> Assume a case when this function is called with start = 1000 and end = 1800.
> For 2M, page mask is 0x1ff. start and end both are 2M aligned.
> 
> First update_mixed takes care of 0x1000-0x1200
> Loop update_mixed: goes over from 0x1200 - 0x1800, there are no pages left
> for last update_mixed to process.

Oops, good catch. I would fix it differently by playing with lpage_end:
	lpage_end = (end - 1) & mask;

Thanks,
Chao

> 
> > 
> > It's also possible I misunderstand something here.
> > 
> > Chao
> >>
> >>> +
> >>> +		update_mixed(lpage_info_slot(lpage_end, slot, level),
> >>> +			     mem_attr_is_mixed(kvm, attr, lpage_end,
> >>> +							  lpage_end + pages));
> 
> lpage_info_slot some times causes a crash, as I noticed that
> lpage_info_slot() returns out-of-bound index.
> 
> Regards
> Nikunj
>