[4/7] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>

Message ID	20241201035358.2193078-5-binbin.wu@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 121054087C; Sun, 1 Dec 2024 03:52:12 +0000 (UTC) From: Binbin Wu <binbin.wu@linux.intel.com> To: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org Cc: rick.p.edgecombe@intel.com, kai.huang@intel.com, adrian.hunter@intel.com, reinette.chatre@intel.com, xiaoyao.li@intel.com, tony.lindgren@linux.intel.com, isaku.yamahata@intel.com, yan.y.zhao@intel.com, chao.gao@intel.com, michael.roth@amd.com, linux-kernel@vger.kernel.org, binbin.wu@linux.intel.com Subject: [PATCH 4/7] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA> Date: Sun, 1 Dec 2024 11:53:53 +0800 Message-ID: <20241201035358.2193078-5-binbin.wu@linux.intel.com> In-Reply-To: <20241201035358.2193078-1-binbin.wu@linux.intel.com> References: <20241201035358.2193078-1-binbin.wu@linux.intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	KVM: TDX: TDX hypercalls may exit to userspace \| expand [0/7] KVM: TDX: TDX hypercalls may exit to userspace [1/7] KVM: TDX: Add a place holder to handle TDX VM exit [2/7] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) [3/7] KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL [4/7] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA> [5/7] KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError> [6/7] KVM: TDX: Handle TDX PV port I/O hypercall [7/7] KVM: TDX: Handle TDX PV MMIO hypercall

Binbin Wu Dec. 1, 2024, 3:53 a.m. UTC

Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling.

MapGPA is used by TDX guest to request to map a GPA range as private
or shared memory.  It needs to exit to userspace for handling.  KVM has
already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will
exit to userspace with exit reason KVM_EXIT_HYPERCALL.  Do sanity checks,
convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request
to userspace.

To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA
range is split into 2MB chunks and check interrupt pending between chunks.
This allows for timely injection of interrupts and prevents issues with
guest lockup detection.  TDX guest should retry the operation for the
GPA starting at the address specified in R11 when the TDVMCALL return
TDVMCALL_RETRY as status code.

Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE bit set for TD VM.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
Hypercalls exit to userspace breakout:
- New added.
  Implement one of the hypercalls need to exit to userspace for handling after
  dropping "KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL", which tries to resolve
  Sean's comment.
  https://lore.kernel.org/kvm/Zg18ul8Q4PGQMWam@google.com/
- Check interrupt pending between chunks suggested by Sean.
  https://lore.kernel.org/kvm/ZleJvmCawKqmpFIa@google.com/
- Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)
- Use vt_is_tdx_private_gpa()
---
 arch/x86/include/asm/shared/tdx.h |   1 +
 arch/x86/kvm/vmx/tdx.c            | 110 ++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h            |   3 +
 3 files changed, 114 insertions(+)

Chao Gao Dec. 9, 2024, 12:45 p.m. UTC | #1

>+/*
>+ * Split into chunks and check interrupt pending between chunks.  This allows
>+ * for timely injection of interrupts to prevent issues with guest lockup
>+ * detection.

Would it cause any problems if an (intra-host or inter-host) migration happens
between chunks?

My understanding is that KVM would lose track of the progress if
map_gpa_next/end are not migrated. I'm not sure if KVM should expose the
state or prevent migration in the middle. Or, we can let the userspace VMM
cut the range into chunks, making it the userspace VMM's responsibility to
ensure necessary state is migrated.

I am not asking to fix this issue right now. I just want to ensure this issue
can be solved in a clean way when we start to support migration.

>+ */
>+#define TDX_MAP_GPA_MAX_LEN (2 * 1024 * 1024)
>+static void __tdx_map_gpa(struct vcpu_tdx * tdx);
>+
>+static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
>+{
>+	struct vcpu_tdx * tdx = to_tdx(vcpu);
>+
>+	if(vcpu->run->hypercall.ret) {
>+		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
>+		kvm_r11_write(vcpu, tdx->map_gpa_next);

s/kvm_r11_write/tdvmcall_set_return_val

please fix other call sites in this patch.

>+		return 1;
>+	}
>+
>+	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
>+	if (tdx->map_gpa_next >= tdx->map_gpa_end) {
>+		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
>+		return 1;
>+	}
>+
>+	/*
>+	 * Stop processing the remaining part if there is pending interrupt.
>+	 * Skip checking pending virtual interrupt (reflected by
>+	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
>+	 * if guest disabled interrupt, it's OK not returning back to guest
>+	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
>+	 * immediately after STI or MOV/POP SS.
>+	 */
>+	if (pi_has_pending_interrupt(vcpu) ||
>+	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
>+		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
>+		kvm_r11_write(vcpu, tdx->map_gpa_next);
>+		return 1;
>+	}
>+
>+	__tdx_map_gpa(tdx);
>+	/* Forward request to userspace. */
>+	return 0;
>+}
>+
>+static void __tdx_map_gpa(struct vcpu_tdx * tdx)
>+{
>+	u64 gpa = tdx->map_gpa_next;
>+	u64 size = tdx->map_gpa_end - tdx->map_gpa_next;
>+
>+	if(size > TDX_MAP_GPA_MAX_LEN)
>+		size = TDX_MAP_GPA_MAX_LEN;
>+
>+	tdx->vcpu.run->exit_reason       = KVM_EXIT_HYPERCALL;
>+	tdx->vcpu.run->hypercall.nr      = KVM_HC_MAP_GPA_RANGE;
>+	tdx->vcpu.run->hypercall.args[0] = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(tdx->vcpu.kvm));
>+	tdx->vcpu.run->hypercall.args[1] = size / PAGE_SIZE;
>+	tdx->vcpu.run->hypercall.args[2] = vt_is_tdx_private_gpa(tdx->vcpu.kvm, gpa) ?
>+					   KVM_MAP_GPA_RANGE_ENCRYPTED :
>+					   KVM_MAP_GPA_RANGE_DECRYPTED;
>+	tdx->vcpu.run->hypercall.flags   = KVM_EXIT_HYPERCALL_LONG_MODE;
>+
>+	tdx->vcpu.arch.complete_userspace_io = tdx_complete_vmcall_map_gpa;
>+}
>+
>+static int tdx_map_gpa(struct kvm_vcpu *vcpu)
>+{
>+	struct vcpu_tdx * tdx = to_tdx(vcpu);
>+	u64 gpa = tdvmcall_a0_read(vcpu);
>+	u64 size = tdvmcall_a1_read(vcpu);
>+	u64 ret;
>+
>+	/*
>+	 * Converting TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE requires
>+	 * userspace to enable KVM_CAP_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE
>+	 * bit set.  If not, the error code is not defined in GHCI for TDX, use
>+	 * TDVMCALL_STATUS_INVALID_OPERAND for this case.
>+	 */
>+	if (!user_exit_on_hypercall(vcpu->kvm, KVM_HC_MAP_GPA_RANGE)) {
>+		ret = TDVMCALL_STATUS_INVALID_OPERAND;
>+		goto error;
>+	}
>+
>+	if (gpa + size <= gpa || !kvm_vcpu_is_legal_gpa(vcpu, gpa) ||
>+	    !kvm_vcpu_is_legal_gpa(vcpu, gpa + size -1) ||
>+	    (vt_is_tdx_private_gpa(vcpu->kvm, gpa) !=
>+	     vt_is_tdx_private_gpa(vcpu->kvm, gpa + size -1))) {
>+		ret = TDVMCALL_STATUS_INVALID_OPERAND;
>+		goto error;
>+	}
>+
>+	if (!PAGE_ALIGNED(gpa) || !PAGE_ALIGNED(size)) {
>+		ret = TDVMCALL_STATUS_ALIGN_ERROR;
>+		goto error;
>+	}
>+
>+	tdx->map_gpa_end = gpa + size;
>+	tdx->map_gpa_next = gpa;
>+
>+	__tdx_map_gpa(tdx);
>+	/* Forward request to userspace. */
>+	return 0;
>+
>+error:
>+	tdvmcall_set_return_code(vcpu, ret);
>+	kvm_r11_write(vcpu, gpa);
>+	return 1;
>+}
>+
> static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> {
> 	if (tdvmcall_exit_type(vcpu))
> 		return tdx_emulate_vmcall(vcpu);
> 
> 	switch (tdvmcall_leaf(vcpu)) {
>+	case TDVMCALL_MAP_GPA:
>+		return tdx_map_gpa(vcpu);
> 	default:
> 		break;
> 	}
>diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
>index 1abc94b046a0..bfae70887695 100644
>--- a/arch/x86/kvm/vmx/tdx.h
>+++ b/arch/x86/kvm/vmx/tdx.h
>@@ -71,6 +71,9 @@ struct vcpu_tdx {
> 
> 	enum tdx_prepare_switch_state prep_switch_state;
> 	u64 msr_host_kernel_gs_base;
>+
>+	u64 map_gpa_next;
>+	u64 map_gpa_end;
> };
> 
> void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
>-- 
>2.46.0
>

Binbin Wu Dec. 10, 2024, 2:51 a.m. UTC | #2

On 12/9/2024 8:45 PM, Chao Gao wrote:
>> +/*
>> + * Split into chunks and check interrupt pending between chunks.  This allows
>> + * for timely injection of interrupts to prevent issues with guest lockup
>> + * detection.
> Would it cause any problems if an (intra-host or inter-host) migration happens
> between chunks?
>
> My understanding is that KVM would lose track of the progress if
> map_gpa_next/end are not migrated. I'm not sure if KVM should expose the
> state or prevent migration in the middle. Or, we can let the userspace VMM
> cut the range into chunks, making it the userspace VMM's responsibility to
> ensure necessary state is migrated.
>
> I am not asking to fix this issue right now. I just want to ensure this issue
> can be solved in a clean way when we start to support migration.
How about:
Before exiting to userspace, KVM always sets the start GPA to r11 and set
return code to TDVMCALL_STATUS_RETRY.
- If userspace finishes the part, the complete_userspace_io() callback will
   be called and the return code will be set to TDVMCALL_STATUS_SUCCESS.
- If the live migration interrupts the MapGAP in the userspace, and
   complete_userspace_io() is not called, when the vCPU resumes from migration,
   TDX guest will see the return code is TDVMCALL_STATUS_RETRY with the failed
   GPA, and it can retry the MapGAP with the failed GAP.


>
>> + */
>> +#define TDX_MAP_GPA_MAX_LEN (2 * 1024 * 1024)
>> +static void __tdx_map_gpa(struct vcpu_tdx * tdx);
>> +
>> +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_tdx * tdx = to_tdx(vcpu);
>> +
>> +	if(vcpu->run->hypercall.ret) {
>> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
>> +		kvm_r11_write(vcpu, tdx->map_gpa_next);
> s/kvm_r11_write/tdvmcall_set_return_val
>
> please fix other call sites in this patch.

I didn't use tdvmcall_set_return_val() because it's used for TDVMCALL
like MMIO, PIO, etc..., which has a return value.
For MapGPA, it's part of error information returned to TDX guest.
So, I used the raw register name version.

Let's see if others also think tdvmcall_set_return_val() is more clear
for this case.

>
>> +		return 1;
>> +	}
>> +
>> +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
>> +	if (tdx->map_gpa_next >= tdx->map_gpa_end) {
>> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
>> +		return 1;
>> +	}
>> +
>> +	/*
>> +	 * Stop processing the remaining part if there is pending interrupt.
>> +	 * Skip checking pending virtual interrupt (reflected by
>> +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
>> +	 * if guest disabled interrupt, it's OK not returning back to guest
>> +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
>> +	 * immediately after STI or MOV/POP SS.
>> +	 */
>> +	if (pi_has_pending_interrupt(vcpu) ||
>> +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
>> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
>> +		kvm_r11_write(vcpu, tdx->map_gpa_next);
>> +		return 1;
>> +	}
>> +
>> +	__tdx_map_gpa(tdx);
>> +	/* Forward request to userspace. */
>> +	return 0;
>> +}
>> +
>> +static void __tdx_map_gpa(struct vcpu_tdx * tdx)
>> +{
>> +	u64 gpa = tdx->map_gpa_next;
>> +	u64 size = tdx->map_gpa_end - tdx->map_gpa_next;
>> +
>> +	if(size > TDX_MAP_GPA_MAX_LEN)
>> +		size = TDX_MAP_GPA_MAX_LEN;
>> +
>> +	tdx->vcpu.run->exit_reason       = KVM_EXIT_HYPERCALL;
>> +	tdx->vcpu.run->hypercall.nr      = KVM_HC_MAP_GPA_RANGE;
>> +	tdx->vcpu.run->hypercall.args[0] = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(tdx->vcpu.kvm));
>> +	tdx->vcpu.run->hypercall.args[1] = size / PAGE_SIZE;
>> +	tdx->vcpu.run->hypercall.args[2] = vt_is_tdx_private_gpa(tdx->vcpu.kvm, gpa) ?
>> +					   KVM_MAP_GPA_RANGE_ENCRYPTED :
>> +					   KVM_MAP_GPA_RANGE_DECRYPTED;
>> +	tdx->vcpu.run->hypercall.flags   = KVM_EXIT_HYPERCALL_LONG_MODE;
>> +
>> +	tdx->vcpu.arch.complete_userspace_io = tdx_complete_vmcall_map_gpa;
>> +}
>> +
>> +static int tdx_map_gpa(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_tdx * tdx = to_tdx(vcpu);
>> +	u64 gpa = tdvmcall_a0_read(vcpu);
>> +	u64 size = tdvmcall_a1_read(vcpu);
>> +	u64 ret;
>> +
>> +	/*
>> +	 * Converting TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE requires
>> +	 * userspace to enable KVM_CAP_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE
>> +	 * bit set.  If not, the error code is not defined in GHCI for TDX, use
>> +	 * TDVMCALL_STATUS_INVALID_OPERAND for this case.
>> +	 */
>> +	if (!user_exit_on_hypercall(vcpu->kvm, KVM_HC_MAP_GPA_RANGE)) {
>> +		ret = TDVMCALL_STATUS_INVALID_OPERAND;
>> +		goto error;
>> +	}
>> +
>> +	if (gpa + size <= gpa || !kvm_vcpu_is_legal_gpa(vcpu, gpa) ||
>> +	    !kvm_vcpu_is_legal_gpa(vcpu, gpa + size -1) ||
>> +	    (vt_is_tdx_private_gpa(vcpu->kvm, gpa) !=
>> +	     vt_is_tdx_private_gpa(vcpu->kvm, gpa + size -1))) {
>> +		ret = TDVMCALL_STATUS_INVALID_OPERAND;
>> +		goto error;
>> +	}
>> +
>> +	if (!PAGE_ALIGNED(gpa) || !PAGE_ALIGNED(size)) {
>> +		ret = TDVMCALL_STATUS_ALIGN_ERROR;
>> +		goto error;
>> +	}
>> +
>> +	tdx->map_gpa_end = gpa + size;
>> +	tdx->map_gpa_next = gpa;
>> +
>> +	__tdx_map_gpa(tdx);
>> +	/* Forward request to userspace. */
>> +	return 0;
>> +
>> +error:
>> +	tdvmcall_set_return_code(vcpu, ret);
>> +	kvm_r11_write(vcpu, gpa);
>> +	return 1;
>> +}
>> +
>> static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>> {
>> 	if (tdvmcall_exit_type(vcpu))
>> 		return tdx_emulate_vmcall(vcpu);
>>
>> 	switch (tdvmcall_leaf(vcpu)) {
>> +	case TDVMCALL_MAP_GPA:
>> +		return tdx_map_gpa(vcpu);
>> 	default:
>> 		break;
>> 	}
>> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
>> index 1abc94b046a0..bfae70887695 100644
>> --- a/arch/x86/kvm/vmx/tdx.h
>> +++ b/arch/x86/kvm/vmx/tdx.h
>> @@ -71,6 +71,9 @@ struct vcpu_tdx {
>>
>> 	enum tdx_prepare_switch_state prep_switch_state;
>> 	u64 msr_host_kernel_gs_base;
>> +
>> +	u64 map_gpa_next;
>> +	u64 map_gpa_end;
>> };
>>
>> void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
>> -- 
>> 2.46.0
>>

Chao Gao Dec. 10, 2024, 9:10 a.m. UTC | #3

On Tue, Dec 10, 2024 at 10:51:56AM +0800, Binbin Wu wrote:
>
>
>
>On 12/9/2024 8:45 PM, Chao Gao wrote:
>> > +/*
>> > + * Split into chunks and check interrupt pending between chunks.  This allows
>> > + * for timely injection of interrupts to prevent issues with guest lockup
>> > + * detection.
>> Would it cause any problems if an (intra-host or inter-host) migration happens
>> between chunks?
>> 
>> My understanding is that KVM would lose track of the progress if
>> map_gpa_next/end are not migrated. I'm not sure if KVM should expose the
>> state or prevent migration in the middle. Or, we can let the userspace VMM
>> cut the range into chunks, making it the userspace VMM's responsibility to
>> ensure necessary state is migrated.
>> 
>> I am not asking to fix this issue right now. I just want to ensure this issue
>> can be solved in a clean way when we start to support migration.
>How about:
>Before exiting to userspace, KVM always sets the start GPA to r11 and set
>return code to TDVMCALL_STATUS_RETRY.
>- If userspace finishes the part, the complete_userspace_io() callback will
>  be called and the return code will be set to TDVMCALL_STATUS_SUCCESS.
>- If the live migration interrupts the MapGAP in the userspace, and
>  complete_userspace_io() is not called, when the vCPU resumes from migration,
>  TDX guest will see the return code is TDVMCALL_STATUS_RETRY with the failed
>  GPA, and it can retry the MapGAP with the failed GAP.

Sounds good

Tony Lindgren Dec. 10, 2024, 9:27 a.m. UTC | #4

On Tue, Dec 10, 2024 at 05:10:29PM +0800, Chao Gao wrote:
> On Tue, Dec 10, 2024 at 10:51:56AM +0800, Binbin Wu wrote:
> >On 12/9/2024 8:45 PM, Chao Gao wrote:
> >> > +/*
> >> > + * Split into chunks and check interrupt pending between chunks.  This allows
> >> > + * for timely injection of interrupts to prevent issues with guest lockup
> >> > + * detection.
> >> Would it cause any problems if an (intra-host or inter-host) migration happens
> >> between chunks?
> >> 
> >> My understanding is that KVM would lose track of the progress if
> >> map_gpa_next/end are not migrated. I'm not sure if KVM should expose the
> >> state or prevent migration in the middle. Or, we can let the userspace VMM
> >> cut the range into chunks, making it the userspace VMM's responsibility to
> >> ensure necessary state is migrated.
> >> 
> >> I am not asking to fix this issue right now. I just want to ensure this issue
> >> can be solved in a clean way when we start to support migration.
> >How about:
> >Before exiting to userspace, KVM always sets the start GPA to r11 and set
> >return code to TDVMCALL_STATUS_RETRY.
> >- If userspace finishes the part, the complete_userspace_io() callback will
> >  be called and the return code will be set to TDVMCALL_STATUS_SUCCESS.
> >- If the live migration interrupts the MapGAP in the userspace, and
> >  complete_userspace_io() is not called, when the vCPU resumes from migration,
> >  TDX guest will see the return code is TDVMCALL_STATUS_RETRY with the failed
> >  GPA, and it can retry the MapGAP with the failed GAP.
> 
> Sounds good

Makes sense to me too.

Regards,

Tony

Xiaoyao Li Dec. 13, 2024, 9:32 a.m. UTC | #5

On 12/1/2024 11:53 AM, Binbin Wu wrote:
> Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with
> KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling.
> 
> MapGPA is used by TDX guest to request to map a GPA range as private
> or shared memory.  It needs to exit to userspace for handling.  KVM has
> already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will
> exit to userspace with exit reason KVM_EXIT_HYPERCALL.  Do sanity checks,
> convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request
> to userspace.
> 
> To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA
> range is split into 2MB chunks and check interrupt pending between chunks.
> This allows for timely injection of interrupts and prevents issues with
> guest lockup detection.  TDX guest should retry the operation for the
> GPA starting at the address specified in R11 when the TDVMCALL return
> TDVMCALL_RETRY as status code.
> 
> Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with
> KVM_HC_MAP_GPA_RANGE bit set for TD VM.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> Hypercalls exit to userspace breakout:
> - New added.
>    Implement one of the hypercalls need to exit to userspace for handling after
>    dropping "KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL", which tries to resolve
>    Sean's comment.
>    https://lore.kernel.org/kvm/Zg18ul8Q4PGQMWam@google.com/
> - Check interrupt pending between chunks suggested by Sean.
>    https://lore.kernel.org/kvm/ZleJvmCawKqmpFIa@google.com/
> - Use TDVMCALL_STATUS prefix for TDX call status codes (Binbin)
> - Use vt_is_tdx_private_gpa()
> ---
>   arch/x86/include/asm/shared/tdx.h |   1 +
>   arch/x86/kvm/vmx/tdx.c            | 110 ++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/tdx.h            |   3 +
>   3 files changed, 114 insertions(+)
> 
> diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
> index 620327f0161f..a602d081cf1c 100644
> --- a/arch/x86/include/asm/shared/tdx.h
> +++ b/arch/x86/include/asm/shared/tdx.h
> @@ -32,6 +32,7 @@
>   #define TDVMCALL_STATUS_SUCCESS		0x0000000000000000ULL
>   #define TDVMCALL_STATUS_RETRY		0x0000000000000001ULL
>   #define TDVMCALL_STATUS_INVALID_OPERAND	0x8000000000000000ULL
> +#define TDVMCALL_STATUS_ALIGN_ERROR	0x8000000000000002ULL
>   
>   /*
>    * Bitmasks of exposed registers (with VMM).
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 4cc55b120ab0..553f4cbe0693 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -985,12 +985,122 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
>   	return r > 0;
>   }
>   
> +/*
> + * Split into chunks and check interrupt pending between chunks.  This allows
> + * for timely injection of interrupts to prevent issues with guest lockup
> + * detection.
> + */
> +#define TDX_MAP_GPA_MAX_LEN (2 * 1024 * 1024)
> +static void __tdx_map_gpa(struct vcpu_tdx * tdx);
> +
> +static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx * tdx = to_tdx(vcpu);
> +
> +	if(vcpu->run->hypercall.ret) {
> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND);
> +		kvm_r11_write(vcpu, tdx->map_gpa_next);
> +		return 1;
> +	}
> +
> +	tdx->map_gpa_next += TDX_MAP_GPA_MAX_LEN;
> +	if (tdx->map_gpa_next >= tdx->map_gpa_end) {
> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_SUCCESS);
> +		return 1;
> +	}
> +
> +	/*
> +	 * Stop processing the remaining part if there is pending interrupt.
> +	 * Skip checking pending virtual interrupt (reflected by
> +	 * TDX_VCPU_STATE_DETAILS_INTR_PENDING bit) to save a seamcall because
> +	 * if guest disabled interrupt, it's OK not returning back to guest
> +	 * due to non-NMI interrupt. Also it's rare to TDVMCALL_MAP_GPA
> +	 * immediately after STI or MOV/POP SS.
> +	 */
> +	if (pi_has_pending_interrupt(vcpu) ||
> +	    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
> +		tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
> +		kvm_r11_write(vcpu, tdx->map_gpa_next);
> +		return 1;
> +	}
> +
> +	__tdx_map_gpa(tdx);
> +	/* Forward request to userspace. */
> +	return 0;
> +}
> +
> +static void __tdx_map_gpa(struct vcpu_tdx * tdx)
> +{
> +	u64 gpa = tdx->map_gpa_next;
> +	u64 size = tdx->map_gpa_end - tdx->map_gpa_next;
> +
> +	if(size > TDX_MAP_GPA_MAX_LEN)
> +		size = TDX_MAP_GPA_MAX_LEN;
> +
> +	tdx->vcpu.run->exit_reason       = KVM_EXIT_HYPERCALL;
> +	tdx->vcpu.run->hypercall.nr      = KVM_HC_MAP_GPA_RANGE;
> +	tdx->vcpu.run->hypercall.args[0] = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(tdx->vcpu.kvm));
> +	tdx->vcpu.run->hypercall.args[1] = size / PAGE_SIZE;
> +	tdx->vcpu.run->hypercall.args[2] = vt_is_tdx_private_gpa(tdx->vcpu.kvm, gpa) ?
> +					   KVM_MAP_GPA_RANGE_ENCRYPTED :
> +					   KVM_MAP_GPA_RANGE_DECRYPTED;
> +	tdx->vcpu.run->hypercall.flags   = KVM_EXIT_HYPERCALL_LONG_MODE;
> +
> +	tdx->vcpu.arch.complete_userspace_io = tdx_complete_vmcall_map_gpa;
> +}
> +
> +static int tdx_map_gpa(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx * tdx = to_tdx(vcpu);
> +	u64 gpa = tdvmcall_a0_read(vcpu);

We can use kvm_r12_read() directly, which is more intuitive. And we can 
drop the MACRO for a0/a1/a2/a3 accessors in patch 022.

> +	u64 size = tdvmcall_a1_read(vcpu);
> +	u64 ret;
> +
> +	/*
> +	 * Converting TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE requires
> +	 * userspace to enable KVM_CAP_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE
> +	 * bit set.  If not, the error code is not defined in GHCI for TDX, use
> +	 * TDVMCALL_STATUS_INVALID_OPERAND for this case.
> +	 */
> +	if (!user_exit_on_hypercall(vcpu->kvm, KVM_HC_MAP_GPA_RANGE)) {
> +		ret = TDVMCALL_STATUS_INVALID_OPERAND;
> +		goto error;
> +	}
> +
> +	if (gpa + size <= gpa || !kvm_vcpu_is_legal_gpa(vcpu, gpa) ||
> +	    !kvm_vcpu_is_legal_gpa(vcpu, gpa + size -1) ||
> +	    (vt_is_tdx_private_gpa(vcpu->kvm, gpa) !=
> +	     vt_is_tdx_private_gpa(vcpu->kvm, gpa + size -1))) {
> +		ret = TDVMCALL_STATUS_INVALID_OPERAND;
> +		goto error;
> +	}
> +
> +	if (!PAGE_ALIGNED(gpa) || !PAGE_ALIGNED(size)) {
> +		ret = TDVMCALL_STATUS_ALIGN_ERROR;
> +		goto error;
> +	}
> +
> +	tdx->map_gpa_end = gpa + size;
> +	tdx->map_gpa_next = gpa;
> +
> +	__tdx_map_gpa(tdx);
> +	/* Forward request to userspace. */
> +	return 0;

Maybe let __tdx_map_gpa() return a int to decide whether it needs to 
return to userspace and

	return __tdx_map_gpa(tdx);

?


> +
> +error:
> +	tdvmcall_set_return_code(vcpu, ret);
> +	kvm_r11_write(vcpu, gpa);
> +	return 1;
> +}
> +
>   static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>   {
>   	if (tdvmcall_exit_type(vcpu))
>   		return tdx_emulate_vmcall(vcpu);
>   
>   	switch (tdvmcall_leaf(vcpu)) {
> +	case TDVMCALL_MAP_GPA:
> +		return tdx_map_gpa(vcpu);
>   	default:
>   		break;
>   	}
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 1abc94b046a0..bfae70887695 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -71,6 +71,9 @@ struct vcpu_tdx {
>   
>   	enum tdx_prepare_switch_state prep_switch_state;
>   	u64 msr_host_kernel_gs_base;
> +
> +	u64 map_gpa_next;
> +	u64 map_gpa_end;
>   };
>   
>   void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);

Binbin Wu Dec. 16, 2024, 1:08 a.m. UTC | #6

On 12/13/2024 5:32 PM, Xiaoyao Li wrote:
> On 12/1/2024 11:53 AM, Binbin Wu wrote:
>
[...]
>> +
>> +static int tdx_map_gpa(struct kvm_vcpu *vcpu)
>> +{
>> +    struct vcpu_tdx * tdx = to_tdx(vcpu);
>> +    u64 gpa = tdvmcall_a0_read(vcpu);
>
> We can use kvm_r12_read() directly, which is more intuitive. And we can drop the MACRO for a0/a1/a2/a3 accessors in patch 022.
I am neutral about it.


>
>> +    u64 size = tdvmcall_a1_read(vcpu);
>> +    u64 ret;
>> +
>> +    /*
>> +     * Converting TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE requires
>> +     * userspace to enable KVM_CAP_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE
>> +     * bit set.  If not, the error code is not defined in GHCI for TDX, use
>> +     * TDVMCALL_STATUS_INVALID_OPERAND for this case.
>> +     */
>> +    if (!user_exit_on_hypercall(vcpu->kvm, KVM_HC_MAP_GPA_RANGE)) {
>> +        ret = TDVMCALL_STATUS_INVALID_OPERAND;
>> +        goto error;
>> +    }
>> +
>> +    if (gpa + size <= gpa || !kvm_vcpu_is_legal_gpa(vcpu, gpa) ||
>> +        !kvm_vcpu_is_legal_gpa(vcpu, gpa + size -1) ||
>> +        (vt_is_tdx_private_gpa(vcpu->kvm, gpa) !=
>> +         vt_is_tdx_private_gpa(vcpu->kvm, gpa + size -1))) {
>> +        ret = TDVMCALL_STATUS_INVALID_OPERAND;
>> +        goto error;
>> +    }
>> +
>> +    if (!PAGE_ALIGNED(gpa) || !PAGE_ALIGNED(size)) {
>> +        ret = TDVMCALL_STATUS_ALIGN_ERROR;
>> +        goto error;
>> +    }
>> +
>> +    tdx->map_gpa_end = gpa + size;
>> +    tdx->map_gpa_next = gpa;
>> +
>> +    __tdx_map_gpa(tdx);
>> +    /* Forward request to userspace. */
>> +    return 0;
>
> Maybe let __tdx_map_gpa() return a int to decide whether it needs to return to userspace and
>
>     return __tdx_map_gpa(tdx);
>
> ?

To save one line of code and the comment?
Because MapGPA always goes to userspace, I don't want to make a function return
a int that is a fixed value.
And if the multiple comments bother you, I think the comments can be removed.

>
>
>> +
>> +error:
>> +    tdvmcall_set_return_code(vcpu, ret);
>> +    kvm_r11_write(vcpu, gpa);
>> +    return 1;
>> +}
>> +
>>   static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>>   {
>>       if (tdvmcall_exit_type(vcpu))
>>           return tdx_emulate_vmcall(vcpu);
>>         switch (tdvmcall_leaf(vcpu)) {
>> +    case TDVMCALL_MAP_GPA:
>> +        return tdx_map_gpa(vcpu);
>>       default:
>>           break;
>>       }
>> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
>> index 1abc94b046a0..bfae70887695 100644
>> --- a/arch/x86/kvm/vmx/tdx.h
>> +++ b/arch/x86/kvm/vmx/tdx.h
>> @@ -71,6 +71,9 @@ struct vcpu_tdx {
>>         enum tdx_prepare_switch_state prep_switch_state;
>>       u64 msr_host_kernel_gs_base;
>> +
>> +    u64 map_gpa_next;
>> +    u64 map_gpa_end;
>>   };
>>     void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 field, u64 err);
>

Xiaoyao Li Dec. 16, 2024, 6:03 a.m. UTC | #7

On 12/16/2024 9:08 AM, Binbin Wu wrote:
> 
> 
> 
> On 12/13/2024 5:32 PM, Xiaoyao Li wrote:
>> On 12/1/2024 11:53 AM, Binbin Wu wrote:
>>
> [...]
>>> +
>>> +static int tdx_map_gpa(struct kvm_vcpu *vcpu)
>>> +{
>>> +    struct vcpu_tdx * tdx = to_tdx(vcpu);
>>> +    u64 gpa = tdvmcall_a0_read(vcpu);
>>
>> We can use kvm_r12_read() directly, which is more intuitive. And we 
>> can drop the MACRO for a0/a1/a2/a3 accessors in patch 022.
> I am neutral about it.
> 

a0, a1, a2, a3, are the name convention for KVM's hypercall. It makes 
sense when serving as the parameters to  __kvm_emulate_hypercall().

However, now __kvm_emulate_hypercall() is made to a MACRO and we don't 
need the temp variable like a0 = kvm_xx_read().

For TDVMCALL leafs other than normal KVM hypercalls, they are all TDX 
specific and defined in TDX GHCI spec, a0/a1/a2/a3 makes no sense for them.

> 
>>
>>> +    u64 size = tdvmcall_a1_read(vcpu);
>>> +    u64 ret;
>>> +
>>> +    /*
>>> +     * Converting TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE requires
>>> +     * userspace to enable KVM_CAP_EXIT_HYPERCALL with 
>>> KVM_HC_MAP_GPA_RANGE
>>> +     * bit set.  If not, the error code is not defined in GHCI for 
>>> TDX, use
>>> +     * TDVMCALL_STATUS_INVALID_OPERAND for this case.
>>> +     */
>>> +    if (!user_exit_on_hypercall(vcpu->kvm, KVM_HC_MAP_GPA_RANGE)) {
>>> +        ret = TDVMCALL_STATUS_INVALID_OPERAND;
>>> +        goto error;
>>> +    }
>>> +
>>> +    if (gpa + size <= gpa || !kvm_vcpu_is_legal_gpa(vcpu, gpa) ||
>>> +        !kvm_vcpu_is_legal_gpa(vcpu, gpa + size -1) ||
>>> +        (vt_is_tdx_private_gpa(vcpu->kvm, gpa) !=
>>> +         vt_is_tdx_private_gpa(vcpu->kvm, gpa + size -1))) {
>>> +        ret = TDVMCALL_STATUS_INVALID_OPERAND;
>>> +        goto error;
>>> +    }
>>> +
>>> +    if (!PAGE_ALIGNED(gpa) || !PAGE_ALIGNED(size)) {
>>> +        ret = TDVMCALL_STATUS_ALIGN_ERROR;
>>> +        goto error;
>>> +    }
>>> +
>>> +    tdx->map_gpa_end = gpa + size;
>>> +    tdx->map_gpa_next = gpa;
>>> +
>>> +    __tdx_map_gpa(tdx);
>>> +    /* Forward request to userspace. */
>>> +    return 0;
>>
>> Maybe let __tdx_map_gpa() return a int to decide whether it needs to 
>> return to userspace and
>>
>>     return __tdx_map_gpa(tdx);
>>
>> ?
> 
> To save one line of code and the comment?

No. Just I found most of the cases that need to exit to usespace, comes 
with "return 0" after setting up the run->exit_reason and run->(fields).

> Because MapGPA always goes to userspace, I don't want to make a function 
> return
> a int that is a fixed value.
> And if the multiple comments bother you, I think the comments can be 
> removed.
> 
>>
>>
>>> +
>>> +error:
>>> +    tdvmcall_set_return_code(vcpu, ret);
>>> +    kvm_r11_write(vcpu, gpa);
>>> +    return 1;
>>> +}
>>> +
>>>   static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>>>   {
>>>       if (tdvmcall_exit_type(vcpu))
>>>           return tdx_emulate_vmcall(vcpu);
>>>         switch (tdvmcall_leaf(vcpu)) {
>>> +    case TDVMCALL_MAP_GPA:
>>> +        return tdx_map_gpa(vcpu);
>>>       default:
>>>           break;
>>>       }
>>> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
>>> index 1abc94b046a0..bfae70887695 100644
>>> --- a/arch/x86/kvm/vmx/tdx.h
>>> +++ b/arch/x86/kvm/vmx/tdx.h
>>> @@ -71,6 +71,9 @@ struct vcpu_tdx {
>>>         enum tdx_prepare_switch_state prep_switch_state;
>>>       u64 msr_host_kernel_gs_base;
>>> +
>>> +    u64 map_gpa_next;
>>> +    u64 map_gpa_end;
>>>   };
>>>     void tdh_vp_rd_failed(struct vcpu_tdx *tdx, char *uclass, u32 
>>> field, u64 err);
>>
>

Binbin Wu Dec. 18, 2024, 1:38 a.m. UTC | #8

On 12/16/2024 2:03 PM, Xiaoyao Li wrote:
> On 12/16/2024 9:08 AM, Binbin Wu wrote:
>>
>>
>>
>> On 12/13/2024 5:32 PM, Xiaoyao Li wrote:
>>> On 12/1/2024 11:53 AM, Binbin Wu wrote:
>>>
>> [...]
>>>> +
>>>> +static int tdx_map_gpa(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +    struct vcpu_tdx * tdx = to_tdx(vcpu);
>>>> +    u64 gpa = tdvmcall_a0_read(vcpu);
>>>
>>> We can use kvm_r12_read() directly, which is more intuitive. And we can drop the MACRO for a0/a1/a2/a3 accessors in patch 022.
>> I am neutral about it.
>>
>
> a0, a1, a2, a3, are the name convention for KVM's hypercall. It makes sense when serving as the parameters to __kvm_emulate_hypercall().
>
> However, now __kvm_emulate_hypercall() is made to a MACRO and we don't need the temp variable like a0 = kvm_xx_read().
>
> For TDVMCALL leafs other than normal KVM hypercalls, they are all TDX specific and defined in TDX GHCI spec, a0/a1/a2/a3 makes no sense for them.
OK, make sense.

Thanks!

Binbin Wu Dec. 18, 2024, 5:09 a.m. UTC | #9

On 12/18/2024 9:38 AM, Binbin Wu wrote:
>
>
>
> On 12/16/2024 2:03 PM, Xiaoyao Li wrote:
>> On 12/16/2024 9:08 AM, Binbin Wu wrote:
>>>
>>>
>>>
>>> On 12/13/2024 5:32 PM, Xiaoyao Li wrote:
>>>> On 12/1/2024 11:53 AM, Binbin Wu wrote:
>>>>
>>> [...]
>>>>> +
>>>>> +static int tdx_map_gpa(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> +    struct vcpu_tdx * tdx = to_tdx(vcpu);
>>>>> +    u64 gpa = tdvmcall_a0_read(vcpu);
>>>>
>>>> We can use kvm_r12_read() directly, which is more intuitive. And we can drop the MACRO for a0/a1/a2/a3 accessors in patch 022.
>>> I am neutral about it.
>>>
>>
>> a0, a1, a2, a3, are the name convention for KVM's hypercall. It makes sense when serving as the parameters to __kvm_emulate_hypercall().
>>
>> However, now __kvm_emulate_hypercall() is made to a MACRO and we don't need the temp variable like a0 = kvm_xx_read().
>>
>> For TDVMCALL leafs other than normal KVM hypercalls, they are all TDX specific and defined in TDX GHCI spec, a0/a1/a2/a3 makes no sense for them.
> OK, make sense.
>
> Thanks!
>
>
I will leave it as it is for now.

There is an RFC proposal from Hunter about using descriptively named parameters
https://lore.kernel.org/all/fa817f29-e3ba-4c54-8600-e28cf6ab1953@intel.com/

It can wait until there is a final conclusion.

[4/7] KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>

Commit Message

Comments

Patch