[v18,064/121] KVM: TDX: Create initial guest memory

Message ID	97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0D18C6774A; Mon, 22 Jan 2024 23:55:55 +0000 (UTC) From: isaku.yamahata@intel.com To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: isaku.yamahata@intel.com, isaku.yamahata@gmail.com, Paolo Bonzini <pbonzini@redhat.com>, erdemaktas@google.com, Sean Christopherson <seanjc@google.com>, Sagi Shahar <sagis@google.com>, Kai Huang <kai.huang@intel.com>, chen.bo@intel.com, hang.yuan@intel.com, tina.zhang@intel.com, gkirkpatrick@google.com Subject: [PATCH v18 064/121] KVM: TDX: Create initial guest memory Date: Mon, 22 Jan 2024 15:53:40 -0800 Message-Id: <97bb1f2996d8a7b828cd9e3309380d1a86ca681b.1705965635.git.isaku.yamahata@intel.com> In-Reply-To: <cover.1705965634.git.isaku.yamahata@intel.com> References: <cover.1705965634.git.isaku.yamahata@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	KVM TDX basic feature support \| expand [v18,000/121] KVM TDX basic feature support [v18,001/121] x86/virt/tdx: Export TDX KeyID information [v18,002/121] x86/virt/tdx: Export SEAMCALL functions [v18,003/121] KVM: x86: Add is_vm_type_supported callback [v18,004/121] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX [v18,005/121] KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_hardware_setup() [v18,006/121] KVM: x86/vmx: Refactor KVM VMX module init/exit functions [v18,007/121] KVM: VMX: Reorder vmx initialization with kvm vendor initialization [v18,008/121] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module [v18,009/121] KVM: TDX: Add placeholders for TDX VM/vcpu structure [v18,010/121] KVM: TDX: Make TDX VM type supported [v18,011/121,MARKER] The start of TDX KVM patch series: TDX architectural definitions [v18,012/121] KVM: TDX: Define TDX architectural definitions [v18,013/121] KVM: TDX: Add TDX "architectural" error codes [v18,014/121] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module [v18,015/121] KVM: TDX: Retry SEAMCALL on the lack of entropy error [v18,016/121] KVM: TDX: Add helper functions to print TDX SEAMCALL error [v18,017/121,MARKER] The start of TDX KVM patch series: TD VM creation/destruction [v18,018/121] KVM: TDX: Add helper functions to allocate/free TDX private host key id [v18,019/121] KVM: TDX: Add helper function to read TDX metadata in array [v18,020/121] x86/virt/tdx: Get system-wide info about TDX module on initialization [v18,021/121] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl [v18,022/121] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters [v18,023/121] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific [v18,024/121] KVM: TDX: create/destroy VM structure [v18,025/121] KVM: TDX: initialize VM with TDX specific parameters [v18,026/121] KVM: TDX: Make pmu_intel.c ignore guest TD case [v18,027/121] KVM: TDX: Refuse to unplug the last cpu on the package [v18,028/121,MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction [v18,029/121] KVM: TDX: create/free TDX vcpu structure [v18,030/121] KVM: TDX: Do TDX specific vcpu initialization [v18,031/121,MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits [v18,032/121] KVM: x86/mmu: introduce config for PRIVATE KVM MMU [v18,033/121] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA [v18,034/121,MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX [v18,035/121] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values [v18,036/121] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE [v18,037/121] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE [v18,038/121] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask/shadow_present_mask [v18,039/121] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis [v18,040/121] KVM: x86/mmu: Disallow fast page fault on private GPA [v18,041/121] KVM: x86/mmu: Allow per-VM override of the TDP max page level [v18,042/121] KVM: VMX: Introduce test mode related to EPT violation VE [v18,043/121,MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks [v18,044/121] KVM: x86/mmu: Assume guest MMIOs are shared [v18,045/121] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation [v18,046/121] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role [v18,047/121] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page [v18,048/121] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases [v18,049/121] KVM: x86/tdp_mmu: Apply mmu notifier callback to only shared GPA [v18,050/121] KVM: x86/tdp_mmu: Sprinkle __must_check [v18,051/121] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU [v18,052/121,MARKER] The start of TDX KVM patch series: TDX EPT violation [v18,053/121] KVM: x86/mmu: TDX: Do not enable page track for TD guest [v18,054/121] KVM: VMX: Split out guts of EPT violation to common/exposed function [v18,055/121] KVM: VMX: Move setting of EPT MMU masks to common VT-x code [v18,056/121] KVM: TDX: Add accessors VMX VMCS helpers [v18,057/121] KVM: TDX: Add load_mmu_pgd method for TDX [v18,058/121] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT [v18,059/121] KVM: TDX: Require TDP MMU and mmio caching for TDX [v18,060/121] KVM: TDX: TDP MMU TDX support [v18,061/121] KVM: TDX: MTRR: implement get_mt_mask() for TDX [v18,062/121,MARKER] The start of TDX KVM patch series: TD finalization [v18,063/121] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX [v18,064/121] KVM: TDX: Create initial guest memory [v18,065/121] KVM: TDX: Finalize VM initialization [v18,066/121,MARKER] The start of TDX KVM patch series: TD vcpu enter/exit [v18,067/121] KVM: TDX: Implement TDX vcpu enter/exit path [v18,068/121] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) [v18,069/121] KVM: TDX: restore host xsave state when exit from the guest TD [v18,070/121] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr [v18,071/121] KVM: TDX: restore user ret MSRs [v18,072/121] KVM: TDX: Add TSX_CTRL msr into uret_msrs list [v18,073/121,MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls [v18,074/121] KVM: TDX: complete interrupts after tdexit [v18,075/121] KVM: TDX: restore debug store when TD exit [v18,076/121] KVM: TDX: handle vcpu migration over logical processor [v18,077/121] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior [v18,078/121] KVM: TDX: Add support for find pending IRQ in a protected local APIC [v18,079/121] KVM: x86: Assume timer IRQ was injected if APIC state is proteced [v18,080/121] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c [v18,081/121] KVM: TDX: Implement interrupt injection [v18,082/121] KVM: TDX: Implements vcpu request_immediate_exit [v18,083/121] KVM: TDX: Implement methods to inject NMI [v18,084/121] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument [v18,085/121] KVM: VMX: Move NMI/exception handler to common helper [v18,086/121] KVM: x86: Split core of hypercall emulation to helper function [v18,087/121] KVM: TDX: Add a place holder to handle TDX VM exit [v18,088/121] KVM: TDX: Handle vmentry failure for INTEL TD guest [v18,089/121] KVM: TDX: handle EXIT_REASON_OTHER_SMI [v18,090/121] KVM: TDX: handle ept violation/misconfig exit [v18,091/121] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT [v18,092/121] KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI [v18,093/121] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) [v18,094/121] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL [v18,095/121] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL [v18,096/121] KVM: TDX: Handle TDX PV CPUID hypercall [v18,097/121] KVM: TDX: Handle TDX PV HLT hypercall [v18,098/121] KVM: TDX: Handle TDX PV port io hypercall [v18,099/121] KVM: TDX: Handle TDX PV MMIO hypercall [v18,100/121] KVM: TDX: Implement callbacks for MSR operations for TDX [v18,101/121] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall [v18,102/121] KVM: TDX: Handle MSR MTRRCap and MTRRDefType access [v18,103/121] KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL [v18,104/121] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall [v18,105/121] KVM: TDX: Silently discard SMI request [v18,106/121] KVM: TDX: Silently ignore INIT/SIPI [v18,107/121] KVM: TDX: Add methods to ignore accesses to CPU state [v18,108/121] KVM: TDX: Add methods to ignore guest instruction emulation [v18,109/121] KVM: TDX: Add a method to ignore dirty logging [v18,110/121] KVM: TDX: Add methods to ignore VMX preemption timer [v18,111/121] KVM: TDX: Add methods to ignore accesses to TSC [v18,112/121] KVM: TDX: Ignore setting up mce [v18,113/121] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch [v18,114/121] KVM: TDX: Add methods to ignore virtual apic related operation [v18,115/121] KVM: TDX: Inhibit APICv for TDX guest [v18,116/121] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) [v18,117/121] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU [v18,118/121] KVM: TDX: Add hint TDX ioctl to release Secure-EPT [v18,119/121] RFC: KVM: x86: Add x86 callback to check cpuid [v18,120/121] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2 [v18,121/121,MARKER] the end of (the first phase of) TDX KVM patch series

Isaku Yamahata Jan. 22, 2024, 11:53 p.m. UTC

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because the guest memory is protected in TDX, the creation of the initial
guest memory requires a dedicated TDX module API, tdh_mem_page_add, instead
of directly copying the memory contents into the guest memory in the case
of the default VM type.  KVM MMU page fault handler callback,
private_page_add, handles it.

Define new subcommand, KVM_TDX_INIT_MEM_REGION, of VM-scoped
KVM_MEMORY_ENCRYPT_OP.  It assigns the guest page, copies the initial
memory contents into the guest memory, encrypts the guest memory.  At the
same time, optionally it extends memory measurement of the TDX guest.  It
calls the KVM MMU page fault(EPT-violation) handler to trigger the
callbacks for it.

Reported-by: gkirkpatrick@google.com
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

---
v18:
- rename tdx_sept_page_add() -> tdx_mem_page_add().
- open code tdx_measure_page() into tdx_mem_page_add().
- remove the change of tools/arch/x86/include/uapi/asm/kvm.h.

v15 -> v16:
- add check if nr_pages isn't large with
  (nr_page << PAGE_SHIFT) >> PAGE_SHIFT

v14 -> v15:
- add a check if TD is finalized or not to tdx_init_mem_region()
- return -EAGAIN when partial population
---
 arch/x86/include/uapi/asm/kvm.h |   9 ++
 arch/x86/kvm/mmu/mmu.c          |   1 +
 arch/x86/kvm/vmx/tdx.c          | 160 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h          |   2 +
 4 files changed, 169 insertions(+), 3 deletions(-)

Sean Christopherson Feb. 1, 2024, 12:20 a.m. UTC | #1

On Mon, Jan 22, 2024, isaku.yamahata@intel.com wrote:
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 4cbcedff4f16..1a5a91b99de9 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -591,6 +591,69 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>  	return 0;
>  }
>  
> +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn,
> +			    enum pg_level level, kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	hpa_t hpa = pfn_to_hpa(pfn);
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	struct tdx_module_args out;
> +	hpa_t source_pa;
> +	bool measure;
> +	u64 err;
> +	int i;
> +
> +	/*
> +	 * KVM_INIT_MEM_REGION, tdx_init_mem_region(), supports only 4K page
> +	 * because tdh_mem_page_add() supports only 4K page.
> +	 */
> +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> +		return -EINVAL;
> +
> +	/*
> +	 * In case of TDP MMU, fault handler can run concurrently.  Note
> +	 * 'source_pa' is a TD scope variable, meaning if there are multiple
> +	 * threads reaching here with all needing to access 'source_pa', it
> +	 * will break.  However fortunately this won't happen, because below
> +	 * TDH_MEM_PAGE_ADD code path is only used when VM is being created
> +	 * before it is running, using KVM_TDX_INIT_MEM_REGION ioctl (which
> +	 * always uses vcpu 0's page table and protected by vcpu->mutex).
> +	 */

Most of the above is superflous.  tdx_mem_page_add() is called if and only if
the TD is finalized, and the TDX module disallow running vCPUs before the TD is
finalized.  That's it.  And maybe throw in a lockdep to assert that kvm->lock is
held.

> +	if (KVM_BUG_ON(kvm_tdx->source_pa == INVALID_PAGE, kvm)) {
> +		tdx_unpin(kvm, pfn);
> +		return -EINVAL;
> +	}
> +
> +	source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
> +	measure = kvm_tdx->source_pa & KVM_TDX_MEASURE_MEMORY_REGION;
> +	kvm_tdx->source_pa = INVALID_PAGE;
> +
> +	do {
> +		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, hpa, source_pa,
> +				       &out);
> +		/*
> +		 * This path is executed during populating initial guest memory
> +		 * image. i.e. before running any vcpu.  Race is rare.

How are races possible at all?

> +		 */
> +	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
> +		tdx_unpin(kvm, pfn);
> +		return -EIO;
> +	} else if (measure) {
> +		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +			err = tdh_mr_extend(kvm_tdx->tdr_pa, gpa + i, &out);
> +			if (KVM_BUG_ON(err, &kvm_tdx->kvm)) {
> +				pr_tdx_error(TDH_MR_EXTEND, err, &out);
> +				break;
> +			}
> +		}

Why is measurement done deep within the MMU?  At a glance, I don't see why this
can't be done up in the ioctl, outside of a spinlock.

And IIRC, the order affects the measurement but doesn't truly matter, e.g. KVM
could choose to completely separate tdh_mr_extend() from tdh_mem_page_add(), no?

> +static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct kvm_tdx_init_mem_region region;
> +	struct kvm_vcpu *vcpu;
> +	struct page *page;
> +	int idx, ret = 0;
> +	bool added = false;
> +
> +	/* Once TD is finalized, the initial guest memory is fixed. */
> +	if (is_td_finalized(kvm_tdx))
> +		return -EINVAL;
> +
> +	/* The BSP vCPU must be created before initializing memory regions. */
> +	if (!atomic_read(&kvm->online_vcpus))
> +		return -EINVAL;
> +
> +	if (cmd->flags & ~KVM_TDX_MEASURE_MEMORY_REGION)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&region, (void __user *)cmd->data, sizeof(region)))
> +		return -EFAULT;
> +
> +	/* Sanity check */
> +	if (!IS_ALIGNED(region.source_addr, PAGE_SIZE) ||
> +	    !IS_ALIGNED(region.gpa, PAGE_SIZE) ||
> +	    !region.nr_pages ||
> +	    region.nr_pages & GENMASK_ULL(63, 63 - PAGE_SHIFT) ||
> +	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
> +	    !kvm_is_private_gpa(kvm, region.gpa) ||
> +	    !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT)))
> +		return -EINVAL;
> +
> +	vcpu = kvm_get_vcpu(kvm, 0);
> +	if (mutex_lock_killable(&vcpu->mutex))
> +		return -EINTR;

The real reason for this drive-by pseudo-review is that I am hoping/wishing we
can turn this into a generic KVM ioctl() to allow userspace to pre-map guest
memory[*].

If we're going to carry non-trivial code, we might as well squeeze as much use
out of it as we can.

Beyond wanting to shove this into KVM_MEMORY_ENCRYPT_OP, is there any reason why
this is a VM ioctl() and not a vCPU ioctl()?  Very roughly, couldn't we use a
struct like this as input to a vCPU ioctl() that maps memory, and optionally
initializes memory from @source?

	struct kvm_memory_mapping {
		__u64 base_gfn;
		__u64 nr_pages;
		__u64 flags;
		__u64 source;
	}

TDX would need to do special things for copying the source, but beyond that most
of the code in this function is generic.

[*] https://lore.kernel.org/all/65262e67-7885-971a-896d-ad9c0a760907@polito.it

David Matlack Feb. 1, 2024, 11:06 p.m. UTC | #2

+Vipin Sharma

On Wed, Jan 31, 2024 at 4:21 PM Sean Christopherson <seanjc@google.com> wrote:
> On Mon, Jan 22, 2024, isaku.yamahata@intel.com wrote:
>
> The real reason for this drive-by pseudo-review is that I am hoping/wishing we
> can turn this into a generic KVM ioctl() to allow userspace to pre-map guest
> memory[*].
>
> If we're going to carry non-trivial code, we might as well squeeze as much use
> out of it as we can.
>
> Beyond wanting to shove this into KVM_MEMORY_ENCRYPT_OP, is there any reason why
> this is a VM ioctl() and not a vCPU ioctl()?  Very roughly, couldn't we use a
> struct like this as input to a vCPU ioctl() that maps memory, and optionally
> initializes memory from @source?
>
>         struct kvm_memory_mapping {
>                 __u64 base_gfn;
>                 __u64 nr_pages;
>                 __u64 flags;
>                 __u64 source;
>         }
>
> TDX would need to do special things for copying the source, but beyond that most
> of the code in this function is generic.
>
> [*] https://lore.kernel.org/all/65262e67-7885-971a-896d-ad9c0a760907@polito.it

We would also be interested in such an API to reduce the guest
performance impact of intra-host migration.

Binbin Wu Feb. 19, 2024, 8:54 a.m. UTC | #3

On 1/23/2024 7:53 AM, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Because the guest memory is protected in TDX, the creation of the initial
> guest memory requires a dedicated TDX module API, tdh_mem_page_add, instead
> of directly copying the memory contents into the guest memory in the case
> of the default VM type.  KVM MMU page fault handler callback,
> private_page_add, handles it.

The changelog is stale?  Do you mean "set_private_spte"?

>
> Define new subcommand, KVM_TDX_INIT_MEM_REGION, of VM-scoped
> KVM_MEMORY_ENCRYPT_OP.  It assigns the guest page, copies the initial
> memory contents into the guest memory, encrypts the guest memory.  At the
> same time, optionally it extends memory measurement of the TDX guest.  It
> calls the KVM MMU page fault(EPT-violation) handler to trigger the
> callbacks for it.
>
> Reported-by: gkirkpatrick@google.com
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>
> ---
> v18:
> - rename tdx_sept_page_add() -> tdx_mem_page_add().
> - open code tdx_measure_page() into tdx_mem_page_add().
> - remove the change of tools/arch/x86/include/uapi/asm/kvm.h.
>
> v15 -> v16:
> - add check if nr_pages isn't large with
>    (nr_page << PAGE_SHIFT) >> PAGE_SHIFT
>
> v14 -> v15:
> - add a check if TD is finalized or not to tdx_init_mem_region()
> - return -EAGAIN when partial population
> ---
>   arch/x86/include/uapi/asm/kvm.h |   9 ++
>   arch/x86/kvm/mmu/mmu.c          |   1 +
>   arch/x86/kvm/vmx/tdx.c          | 160 +++++++++++++++++++++++++++++++-
>   arch/x86/kvm/vmx/tdx.h          |   2 +
>   4 files changed, 169 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 4000a2e087a8..9fda7c90b7b5 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -572,6 +572,7 @@ enum kvm_tdx_cmd_id {
>   	KVM_TDX_CAPABILITIES = 0,
>   	KVM_TDX_INIT_VM,
>   	KVM_TDX_INIT_VCPU,
> +	KVM_TDX_INIT_MEM_REGION,
>   
>   	KVM_TDX_CMD_NR_MAX,
>   };
> @@ -649,4 +650,12 @@ struct kvm_tdx_init_vm {
>   	struct kvm_cpuid2 cpuid;
>   };
>   
> +#define KVM_TDX_MEASURE_MEMORY_REGION	(1UL << 0)
> +
> +struct kvm_tdx_init_mem_region {
> +	__u64 source_addr;
> +	__u64 gpa;
> +	__u64 nr_pages;
> +};
> +
>   #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 26d215e85b76..fc258f112e73 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5663,6 +5663,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>   out:
>   	return r;
>   }
> +EXPORT_SYMBOL(kvm_mmu_load);
>   
>   void kvm_mmu_unload(struct kvm_vcpu *vcpu)
>   {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 4cbcedff4f16..1a5a91b99de9 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -591,6 +591,69 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>   	return 0;
>   }
>   
> +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn,
> +			    enum pg_level level, kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	hpa_t hpa = pfn_to_hpa(pfn);
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	struct tdx_module_args out;
> +	hpa_t source_pa;
> +	bool measure;
> +	u64 err;
> +	int i;
> +
> +	/*
> +	 * KVM_INIT_MEM_REGION, tdx_init_mem_region(), supports only 4K page
> +	 * because tdh_mem_page_add() supports only 4K page.
> +	 */
> +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> +		return -EINVAL;
> +
> +	/*
> +	 * In case of TDP MMU, fault handler can run concurrently.  Note
> +	 * 'source_pa' is a TD scope variable, meaning if there are multiple
> +	 * threads reaching here with all needing to access 'source_pa', it
> +	 * will break.  However fortunately this won't happen, because below
> +	 * TDH_MEM_PAGE_ADD code path is only used when VM is being created
> +	 * before it is running, using KVM_TDX_INIT_MEM_REGION ioctl (which
> +	 * always uses vcpu 0's page table and protected by vcpu->mutex).
> +	 */
> +	if (KVM_BUG_ON(kvm_tdx->source_pa == INVALID_PAGE, kvm)) {
> +		tdx_unpin(kvm, pfn);
> +		return -EINVAL;
> +	}
> +
> +	source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
> +	measure = kvm_tdx->source_pa & KVM_TDX_MEASURE_MEMORY_REGION;
> +	kvm_tdx->source_pa = INVALID_PAGE;
> +
> +	do {
> +		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, hpa, source_pa,
> +				       &out);
> +		/*
> +		 * This path is executed during populating initial guest memory
> +		 * image. i.e. before running any vcpu.  Race is rare.
> +		 */
> +	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));

For page add, since pages are added one by one, there should be no such
error, right?

> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
> +		tdx_unpin(kvm, pfn);
> +		return -EIO;
> +	} else if (measure) {
> +		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +			err = tdh_mr_extend(kvm_tdx->tdr_pa, gpa + i, &out);
> +			if (KVM_BUG_ON(err, &kvm_tdx->kvm)) {
> +				pr_tdx_error(TDH_MR_EXTEND, err, &out);
> +				break;
> +			}
> +		}
> +	}
> +
> +	return 0;
> +
> +}
> +
>   static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   				     enum pg_level level, kvm_pfn_t pfn)
>   {
> @@ -613,9 +676,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   	if (likely(is_td_finalized(kvm_tdx)))
>   		return tdx_mem_page_aug(kvm, gfn, level, pfn);
>   
> -	/* TODO: tdh_mem_page_add() comes here for the initial memory. */
> -
> -	return 0;
> +	return tdx_mem_page_add(kvm, gfn, level, pfn);
>   }
>   
>   static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> @@ -1322,6 +1383,96 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu)
>   	tdx_track(vcpu->kvm);
>   }
>   
> +#define TDX_SEPT_PFERR	(PFERR_WRITE_MASK | PFERR_GUEST_ENC_MASK)
> +
> +static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct kvm_tdx_init_mem_region region;
> +	struct kvm_vcpu *vcpu;
> +	struct page *page;
> +	int idx, ret = 0;
> +	bool added = false;
> +
> +	/* Once TD is finalized, the initial guest memory is fixed. */
> +	if (is_td_finalized(kvm_tdx))
> +		return -EINVAL;
> +
> +	/* The BSP vCPU must be created before initializing memory regions. */
> +	if (!atomic_read(&kvm->online_vcpus))
> +		return -EINVAL;
> +
> +	if (cmd->flags & ~KVM_TDX_MEASURE_MEMORY_REGION)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&region, (void __user *)cmd->data, sizeof(region)))
> +		return -EFAULT;
> +
> +	/* Sanity check */
> +	if (!IS_ALIGNED(region.source_addr, PAGE_SIZE) ||
> +	    !IS_ALIGNED(region.gpa, PAGE_SIZE) ||
> +	    !region.nr_pages ||
> +	    region.nr_pages & GENMASK_ULL(63, 63 - PAGE_SHIFT) ||
> +	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
> +	    !kvm_is_private_gpa(kvm, region.gpa) ||
> +	    !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT)))
> +		return -EINVAL;
> +
> +	vcpu = kvm_get_vcpu(kvm, 0);
> +	if (mutex_lock_killable(&vcpu->mutex))
> +		return -EINTR;
> +
> +	vcpu_load(vcpu);
> +	idx = srcu_read_lock(&kvm->srcu);
> +
> +	kvm_mmu_reload(vcpu);
> +
> +	while (region.nr_pages) {
> +		if (signal_pending(current)) {
> +			ret = -ERESTARTSYS;
> +			break;
> +		}
> +
> +		if (need_resched())
> +			cond_resched();
> +
> +		/* Pin the source page. */
> +		ret = get_user_pages_fast(region.source_addr, 1, 0, &page);
> +		if (ret < 0)
> +			break;
> +		if (ret != 1) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +
> +		kvm_tdx->source_pa = pfn_to_hpa(page_to_pfn(page)) |
> +				     (cmd->flags & KVM_TDX_MEASURE_MEMORY_REGION);
> +
> +		ret = kvm_mmu_map_tdp_page(vcpu, region.gpa, TDX_SEPT_PFERR,
> +					   PG_LEVEL_4K);
> +		put_page(page);
> +		if (ret)
> +			break;
> +
> +		region.source_addr += PAGE_SIZE;
> +		region.gpa += PAGE_SIZE;
> +		region.nr_pages--;
> +		added = true;
> +	}
> +
> +	srcu_read_unlock(&kvm->srcu, idx);
> +	vcpu_put(vcpu);
> +
> +	mutex_unlock(&vcpu->mutex);
> +
> +	if (added && region.nr_pages > 0)
> +		ret = -EAGAIN;
> +	if (copy_to_user((void __user *)cmd->data, &region, sizeof(region)))
> +		ret = -EFAULT;
> +
> +	return ret;
> +}
> +
>   int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>   {
>   	struct kvm_tdx_cmd tdx_cmd;
> @@ -1341,6 +1492,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>   	case KVM_TDX_INIT_VM:
>   		r = tdx_td_init(kvm, &tdx_cmd);
>   		break;
> +	case KVM_TDX_INIT_MEM_REGION:
> +		r = tdx_init_mem_region(kvm, &tdx_cmd);
> +		break;
>   	default:
>   		r = -EINVAL;
>   		goto out;
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 783ce329d7da..d589a2caedfb 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -17,6 +17,8 @@ struct kvm_tdx {
>   	u64 xfam;
>   	int hkid;
>   
> +	hpa_t source_pa;
> +
>   	bool finalized;
>   	atomic_t tdh_mem_track;
>

Isaku Yamahata Feb. 26, 2024, 6:07 p.m. UTC | #4

On Thu, Feb 01, 2024 at 03:06:46PM -0800,
David Matlack <dmatlack@google.com> wrote:

> +Vipin Sharma
> 
> On Wed, Jan 31, 2024 at 4:21 PM Sean Christopherson <seanjc@google.com> wrote:
> > On Mon, Jan 22, 2024, isaku.yamahata@intel.com wrote:
> >
> > The real reason for this drive-by pseudo-review is that I am hoping/wishing we
> > can turn this into a generic KVM ioctl() to allow userspace to pre-map guest
> > memory[*].
> >
> > If we're going to carry non-trivial code, we might as well squeeze as much use
> > out of it as we can.
> >
> > Beyond wanting to shove this into KVM_MEMORY_ENCRYPT_OP, is there any reason why
> > this is a VM ioctl() and not a vCPU ioctl()?  Very roughly, couldn't we use a
> > struct like this as input to a vCPU ioctl() that maps memory, and optionally
> > initializes memory from @source?
> >
> >         struct kvm_memory_mapping {
> >                 __u64 base_gfn;
> >                 __u64 nr_pages;
> >                 __u64 flags;
> >                 __u64 source;
> >         }
> >
> > TDX would need to do special things for copying the source, but beyond that most
> > of the code in this function is generic.
> >
> > [*] https://lore.kernel.org/all/65262e67-7885-971a-896d-ad9c0a760907@polito.it
> 
> We would also be interested in such an API to reduce the guest
> performance impact of intra-host migration.

I introduce KVM_MEMORY_MAPPING and KVM_CAP_MEMORY_MAPPING with v19.
We can continue the discussion there.

Sean Christopherson Feb. 26, 2024, 6:50 p.m. UTC | #5

On Mon, Feb 26, 2024, Isaku Yamahata wrote:
> On Thu, Feb 01, 2024 at 03:06:46PM -0800,
> David Matlack <dmatlack@google.com> wrote:
> 
> > +Vipin Sharma
> > 
> > On Wed, Jan 31, 2024 at 4:21 PM Sean Christopherson <seanjc@google.com> wrote:
> > > On Mon, Jan 22, 2024, isaku.yamahata@intel.com wrote:
> > >
> > > The real reason for this drive-by pseudo-review is that I am hoping/wishing we
> > > can turn this into a generic KVM ioctl() to allow userspace to pre-map guest
> > > memory[*].
> > >
> > > If we're going to carry non-trivial code, we might as well squeeze as much use
> > > out of it as we can.
> > >
> > > Beyond wanting to shove this into KVM_MEMORY_ENCRYPT_OP, is there any reason why
> > > this is a VM ioctl() and not a vCPU ioctl()?  Very roughly, couldn't we use a
> > > struct like this as input to a vCPU ioctl() that maps memory, and optionally
> > > initializes memory from @source?
> > >
> > >         struct kvm_memory_mapping {
> > >                 __u64 base_gfn;
> > >                 __u64 nr_pages;
> > >                 __u64 flags;
> > >                 __u64 source;
> > >         }
> > >
> > > TDX would need to do special things for copying the source, but beyond that most
> > > of the code in this function is generic.
> > >
> > > [*] https://lore.kernel.org/all/65262e67-7885-971a-896d-ad9c0a760907@polito.it
> > 
> > We would also be interested in such an API to reduce the guest
> > performance impact of intra-host migration.
> 
> I introduce KVM_MEMORY_MAPPING and KVM_CAP_MEMORY_MAPPING with v19.

KVM_MEMORY_MAPPING is not a good ioctl() name.  There needs to be a verb in there
somewhere, e.g. KVM_MAP_MEMORY, KVM_FAULTIN_MEMORY, etc.

> We can continue the discussion there.

No, we absolutely cannot continue the conversation there.   That is not how kernel
development works.

Enough is enough.  I am archiving v19 and not touching it.

Please post an RFC for _just_ this functionality, and follow-up in existing,
pre-v19 conversations for anything else that changed between v18 and v19 and might
need additional input/discussion.

Isaku Yamahata Feb. 27, 2024, 2:12 p.m. UTC | #6

On Mon, Feb 26, 2024 at 10:50:55AM -0800,
Sean Christopherson <seanjc@google.com> wrote:

> Please post an RFC for _just_ this functionality, and follow-up in existing,
> pre-v19 conversations for anything else that changed between v18 and v19 and might
> need additional input/discussion.

Sure, will post it. My plan is as follow for input/discussion
- Review SEV-SNP patches by Paolo for commonality 
- RFC patch to KVM_MAP_MEMORY or KVM_FAULTIN_MEMORY
- RFC patch for uKVM for confidential VM

Sean Christopherson Feb. 27, 2024, 5:30 p.m. UTC | #7

On Tue, Feb 27, 2024, Isaku Yamahata wrote:
> On Mon, Feb 26, 2024 at 10:50:55AM -0800,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > Please post an RFC for _just_ this functionality, and follow-up in existing,
> > pre-v19 conversations for anything else that changed between v18 and v19 and might
> > need additional input/discussion.
> 
> Sure, will post it. My plan is as follow for input/discussion
> - Review SEV-SNP patches by Paolo for commonality 
> - RFC patch to KVM_MAP_MEMORY or KVM_FAULTIN_MEMORY
> - RFC patch for uKVM for confidential VM

uKVM?

Isaku Yamahata March 8, 2024, 1:07 a.m. UTC | #8

On Tue, Feb 27, 2024 at 09:30:11AM -0800,
Sean Christopherson <seanjc@google.com> wrote:

> On Tue, Feb 27, 2024, Isaku Yamahata wrote:
> > On Mon, Feb 26, 2024 at 10:50:55AM -0800,
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > Please post an RFC for _just_ this functionality, and follow-up in existing,
> > > pre-v19 conversations for anything else that changed between v18 and v19 and might
> > > need additional input/discussion.
> > 
> > Sure, will post it. My plan is as follow for input/discussion
> > - Review SEV-SNP patches by Paolo for commonality 
> > - RFC patch to KVM_MAP_MEMORY or KVM_FAULTIN_MEMORY
> > - RFC patch for uKVM for confidential VM
> 
> uKVM?

I meant uAPI, sorry for typo.
Although I looked into a unified uAPI with SEV, the gain seem to be small or
none. I'm currently planning to drop it based on the feedback at
https://lore.kernel.org/kvm/ZL%2Fr6Vca8WkFVaic@google.com/

[v18,064/121] KVM: TDX: Create initial guest memory

Commit Message

Comments

Patch