diff mbox series

[v3,1/2] KVM: x86: Check hypercall's exit to userspace generically

Message ID 20240826022255.361406-2-binbin.wu@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Check hypercall's exit to userspace generically | expand

Commit Message

Binbin Wu Aug. 26, 2024, 2:22 a.m. UTC
Check whether a KVM hypercall needs to exit to userspace or not based on
hypercall_exit_enabled field of struct kvm_arch.

Userspace can request a hypercall to exit to userspace for handling by
enable KVM_CAP_EXIT_HYPERCALL and the enabled hypercall will be set in
hypercall_exit_enabled.  Make the check code generic based on it.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/kvm/x86.c | 5 +++--
 arch/x86/kvm/x86.h | 4 ++++
 2 files changed, 7 insertions(+), 2 deletions(-)

Comments

Xiaoyao Li Oct. 9, 2024, 6:48 a.m. UTC | #1
On 8/26/2024 10:22 AM, Binbin Wu wrote:
> Check whether a KVM hypercall needs to exit to userspace or not based on
> hypercall_exit_enabled field of struct kvm_arch.
> 
> Userspace can request a hypercall to exit to userspace for handling by
> enable KVM_CAP_EXIT_HYPERCALL and the enabled hypercall will be set in
> hypercall_exit_enabled.  Make the check code generic based on it.
> 
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> ---
>   arch/x86/kvm/x86.c | 5 +++--
>   arch/x86/kvm/x86.h | 4 ++++
>   2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 966fb301d44b..e521f14ad2b2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10220,8 +10220,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   	cpl = kvm_x86_call(get_cpl)(vcpu);
>   
>   	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> -		/* MAP_GPA tosses the request to the user space. */
> +	/* Check !ret first to make sure nr is a valid KVM hypercall. */
> +	if (!ret && user_exit_on_hypercall(vcpu->kvm, nr))
> +		/* The hypercall is requested to exit to userspace. */

Nit: Above comment is unnecessary since the name of 
user_exit_on_hypercall() is self documenting.

Otherwise,

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

>   		return 0;
>   
>   	if (!op_64_bit)
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 6556a43f1915..bc1a9e080acb 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -561,4 +561,8 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
>   			 unsigned int port, void *data,  unsigned int count,
>   			 int in);
>   
> +static inline bool user_exit_on_hypercall(struct kvm *kvm, unsigned long hc_nr)
> +{
> +	return kvm->arch.hypercall_exit_enabled & BIT(hc_nr);
> +}
>   #endif
Sean Christopherson Oct. 30, 2024, 8:49 p.m. UTC | #2
On Mon, Aug 26, 2024, Binbin Wu wrote:
> Check whether a KVM hypercall needs to exit to userspace or not based on
> hypercall_exit_enabled field of struct kvm_arch.
> 
> Userspace can request a hypercall to exit to userspace for handling by
> enable KVM_CAP_EXIT_HYPERCALL and the enabled hypercall will be set in
> hypercall_exit_enabled.  Make the check code generic based on it.
> 
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> ---
>  arch/x86/kvm/x86.c | 5 +++--
>  arch/x86/kvm/x86.h | 4 ++++
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 966fb301d44b..e521f14ad2b2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10220,8 +10220,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  	cpl = kvm_x86_call(get_cpl)(vcpu);
>  
>  	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> -		/* MAP_GPA tosses the request to the user space. */
> +	/* Check !ret first to make sure nr is a valid KVM hypercall. */
> +	if (!ret && user_exit_on_hypercall(vcpu->kvm, nr))

I don't love that the caller has to re-check for user_exit_on_hypercall().  I
also don't love that there's a surprising number of checks lurking in
__kvm_emulate_hypercall(), e.g. that CPL==0, especially since the above comment
about "a valid KVM hypercall" can be intrepreted as meaning KVM is *only* checking
if the hypercall number is valid.

E.g. my initial reaction was that we could add a separate path for userspace
hypercalls, but that would be subtly wrong.  And my second reaction was to hoist
the common checks out of __kvm_emulate_hypercall(), but then I remembered that
the only reason __kvm_emulate_hypercall() is separate is to allow it to be called
by TDX with different source/destionation registers.

So, I'm strongly leaning towards dropping the above change, squashing the addition
of the helper with patch 2, and then landing this on top.

Thoughts?

--
Subject: [PATCH] KVM: x86: Use '0' in __kvm_emulate_hypercall()  to signal
 "exit to userspace"

Rework __kvm_emulate_hypercall() to use '0' to indicate an exit to
userspace instead of relying on the caller to manually check for success
*and* if user_exit_on_hypercall() is true.  Use '1' for "success" to
(mostly) align with KVM's de factor return codes, where '0' == exit to
userspace, '1' == resume guest, and -errno == failure.  Unfortunately,
some of the PV error codes returned to the guest are postive values, so
the pattern doesn't exactly match KVM's "standard", but it should be close
enough to be intuitive for KVM readers.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e09daa3b157c..5fdeb58221e2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10024,7 +10024,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 
 	switch (nr) {
 	case KVM_HC_VAPIC_POLL_IRQ:
-		ret = 0;
+		ret = 1;
 		break;
 	case KVM_HC_KICK_CPU:
 		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_UNHALT))
@@ -10032,7 +10032,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 
 		kvm_pv_kick_cpu_op(vcpu->kvm, a1);
 		kvm_sched_yield(vcpu, a1);
-		ret = 0;
+		ret = 1;
 		break;
 #ifdef CONFIG_X86_64
 	case KVM_HC_CLOCK_PAIRING:
@@ -10050,7 +10050,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 			break;
 
 		kvm_sched_yield(vcpu, a0);
-		ret = 0;
+		ret = 1;
 		break;
 	case KVM_HC_MAP_GPA_RANGE: {
 		u64 gpa = a0, npages = a1, attrs = a2;
@@ -10111,12 +10111,21 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	cpl = kvm_x86_call(get_cpl)(vcpu);
 
 	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
-	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
-		/* MAP_GPA tosses the request to the user space. */
+	if (!ret)
 		return 0;
 
-	if (!op_64_bit)
+	/*
+	 * KVM's ABI with the guest is that '0' is success, and any other value
+	 * is an error code.  Internally, '0' == exit to userspace (see above)
+	 * and '1' == success, as KVM's de facto standard return codes are that
+	 * plus -errno == failure.  Explicitly check for '1' as some PV error
+	 * codes are positive values.
+	 */
+	if (ret == 1)
+		ret = 0;
+	else if (!op_64_bit)
 		ret = (u32)ret;
+
 	kvm_rax_write(vcpu, ret);
 
 	return kvm_skip_emulated_instruction(vcpu);

base-commit: 675248928970d33f7fc8ca9851a170c98f4f1c4f
--
Kai Huang Oct. 31, 2024, 12:49 a.m. UTC | #3
> @@ -10024,7 +10024,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>  
>  	switch (nr) {
>  	case KVM_HC_VAPIC_POLL_IRQ:
> -		ret = 0;
> +		ret = 1;
>  		break;
>  	case KVM_HC_KICK_CPU:
>  		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_UNHALT))
> @@ -10032,7 +10032,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>  
>  		kvm_pv_kick_cpu_op(vcpu->kvm, a1);
>  		kvm_sched_yield(vcpu, a1);
> -		ret = 0;
> +		ret = 1;
>  		break;
>  #ifdef CONFIG_X86_64
>  	case KVM_HC_CLOCK_PAIRING:
> @@ -10050,7 +10050,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>  			break;
>  
>  		kvm_sched_yield(vcpu, a0);
> -		ret = 0;
> +		ret = 1;
>  		break;
>  	case KVM_HC_MAP_GPA_RANGE: {
>  		u64 gpa = a0, npages = a1, attrs = a2;
> @@ -10111,12 +10111,21 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  	cpl = kvm_x86_call(get_cpl)(vcpu);
>  
>  	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> -		/* MAP_GPA tosses the request to the user space. */
> +	if (!ret)
>  		return 0;
>  
> -	if (!op_64_bit)
> +	/*
> +	 * KVM's ABI with the guest is that '0' is success, and any other value
> +	 * is an error code.  Internally, '0' == exit to userspace (see above)
> +	 * and '1' == success, as KVM's de facto standard return codes are that
> +	 * plus -errno == failure.  Explicitly check for '1' as some PV error
> +	 * codes are positive values.
> +	 */
> +	if (ret == 1)
> +		ret = 0;
> +	else if (!op_64_bit)
>  		ret = (u32)ret;
> +
>  	kvm_rax_write(vcpu, ret);
>  
>  	return kvm_skip_emulated_instruction(vcpu);
> 

It appears below two cases are not covered correctly?

#ifdef CONFIG_X86_64    
        case KVM_HC_CLOCK_PAIRING:
                ret = kvm_pv_clock_pairing(vcpu, a0, a1);
                break;
#endif
        case KVM_HC_SEND_IPI:
                if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SEND_IPI))
                        break;

                ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
                break; 

Looking at the code, seems at least for KVM_HC_CLOCK_PAIRING,
kvm_pv_clock_pairing() returns 0 on success, and the upstream behaviour is not
routing to userspace but writing 0 to vcpu's RAX and returning to guest.  With
the change above, it immediately returns to userspace w/o writing 0 to RAX.

For KVM_HC_SEND_IPI, seems the upstream behaviour is the return value of
kvm_pv_send_ipi() is always written to vcpu's RAX reg, and we always just go
back to guest.  With your change, the behaviour will be changed to:

  1) when ret == 0, exit to userspace w/o writing  0 to vcpu's RAX,
  2) when ret == 1, it is converted to 0 and then written to RAX.

This doesn't look like safe.
Binbin Wu Oct. 31, 2024, 4:56 a.m. UTC | #4
On 10/31/2024 4:49 AM, Sean Christopherson wrote:
> On Mon, Aug 26, 2024, Binbin Wu wrote:
>> Check whether a KVM hypercall needs to exit to userspace or not based on
>> hypercall_exit_enabled field of struct kvm_arch.
>>
>> Userspace can request a hypercall to exit to userspace for handling by
>> enable KVM_CAP_EXIT_HYPERCALL and the enabled hypercall will be set in
>> hypercall_exit_enabled.  Make the check code generic based on it.
>>
>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>> Reviewed-by: Kai Huang <kai.huang@intel.com>
>> ---
>>   arch/x86/kvm/x86.c | 5 +++--
>>   arch/x86/kvm/x86.h | 4 ++++
>>   2 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 966fb301d44b..e521f14ad2b2 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10220,8 +10220,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>>   	cpl = kvm_x86_call(get_cpl)(vcpu);
>>   
>>   	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
>> -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
>> -		/* MAP_GPA tosses the request to the user space. */
>> +	/* Check !ret first to make sure nr is a valid KVM hypercall. */
>> +	if (!ret && user_exit_on_hypercall(vcpu->kvm, nr))
> I don't love that the caller has to re-check for user_exit_on_hypercall().
Agree, it is not ideal.

But if __kvm_emulate_hypercall() returns 0 to indicate user exit and 1 to
indicate success, then the callers have to convert the return code to set
return value for guest.  E.g., TDX code also needs to do the conversion.

> I also don't love that there's a surprising number of checks lurking in
> __kvm_emulate_hypercall(), e.g. that CPL==0, especially since the above comment
> about "a valid KVM hypercall" can be intrepreted as meaning KVM is *only* checking
> if the hypercall number is valid.
>
> E.g. my initial reaction was that we could add a separate path for userspace
> hypercalls, but that would be subtly wrong.  And my second reaction was to hoist
> the common checks out of __kvm_emulate_hypercall(), but then I remembered that
> the only reason __kvm_emulate_hypercall() is separate is to allow it to be called
> by TDX with different source/destionation registers.
>
> So, I'm strongly leaning towards dropping the above change, squashing the addition
> of the helper with patch 2, and then landing this on top.
>
> Thoughts?
I have no strong preference and OK with the proposal below.

Just some cases, which don't get the return value right as pointed by Kai
in another thread.
https://lore.kernel.org/kvm/3f158732a66829faaeb527a94b8df78d6173befa.camel@intel.com/


>
> --
> Subject: [PATCH] KVM: x86: Use '0' in __kvm_emulate_hypercall()  to signal
>   "exit to userspace"
>
> Rework __kvm_emulate_hypercall() to use '0' to indicate an exit to
> userspace instead of relying on the caller to manually check for success
> *and* if user_exit_on_hypercall() is true.  Use '1' for "success" to
> (mostly) align with KVM's de factor return codes, where '0' == exit to
> userspace, '1' == resume guest, and -errno == failure.  Unfortunately,
> some of the PV error codes returned to the guest are postive values, so
> the pattern doesn't exactly match KVM's "standard", but it should be close
> enough to be intuitive for KVM readers.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/x86.c | 21 +++++++++++++++------
>   1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e09daa3b157c..5fdeb58221e2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10024,7 +10024,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>   
>   	switch (nr) {
>   	case KVM_HC_VAPIC_POLL_IRQ:
> -		ret = 0;
> +		ret = 1;
>   		break;
>   	case KVM_HC_KICK_CPU:
>   		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_UNHALT))
> @@ -10032,7 +10032,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>   
>   		kvm_pv_kick_cpu_op(vcpu->kvm, a1);
>   		kvm_sched_yield(vcpu, a1);
> -		ret = 0;
> +		ret = 1;
>   		break;
>   #ifdef CONFIG_X86_64
>   	case KVM_HC_CLOCK_PAIRING:
> @@ -10050,7 +10050,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>   			break;
>   
>   		kvm_sched_yield(vcpu, a0);
> -		ret = 0;
> +		ret = 1;
>   		break;
>   	case KVM_HC_MAP_GPA_RANGE: {
>   		u64 gpa = a0, npages = a1, attrs = a2;
> @@ -10111,12 +10111,21 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   	cpl = kvm_x86_call(get_cpl)(vcpu);
>   
>   	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> -		/* MAP_GPA tosses the request to the user space. */
> +	if (!ret)
>   		return 0;
>   
> -	if (!op_64_bit)
> +	/*
> +	 * KVM's ABI with the guest is that '0' is success, and any other value
> +	 * is an error code.  Internally, '0' == exit to userspace (see above)
> +	 * and '1' == success, as KVM's de facto standard return codes are that
> +	 * plus -errno == failure.  Explicitly check for '1' as some PV error
> +	 * codes are positive values.
> +	 */
I didn't understand the last sentence:
"Explicitly check for '1' as some PV error codes are positive values."

The functions called in __kvm_emulate_hypercall() for PV features return
-KVM_EXXX for error code.
Did you mean the functions like kvm_pv_enable_async_pf(), which return
1 for error, would be called in __kvm_emulate_hypercall() in the future?
If this is the concern, then we cannot simply convert 1 to 0 then.

> +	if (ret == 1)
> +		ret = 0;
> +	else if (!op_64_bit)
>   		ret = (u32)ret;
> +
>   	kvm_rax_write(vcpu, ret);
>   
>   	return kvm_skip_emulated_instruction(vcpu);
>
> base-commit: 675248928970d33f7fc8ca9851a170c98f4f1c4f
Sean Christopherson Oct. 31, 2024, 2:54 p.m. UTC | #5
On Thu, Oct 31, 2024, Kai Huang wrote:
> > @@ -10111,12 +10111,21 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >  	cpl = kvm_x86_call(get_cpl)(vcpu);
> >  
> >  	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> > -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> > -		/* MAP_GPA tosses the request to the user space. */
> > +	if (!ret)
> >  		return 0;
> >  
> > -	if (!op_64_bit)
> > +	/*
> > +	 * KVM's ABI with the guest is that '0' is success, and any other value
> > +	 * is an error code.  Internally, '0' == exit to userspace (see above)
> > +	 * and '1' == success, as KVM's de facto standard return codes are that
> > +	 * plus -errno == failure.  Explicitly check for '1' as some PV error
> > +	 * codes are positive values.
> > +	 */
> > +	if (ret == 1)
> > +		ret = 0;
> > +	else if (!op_64_bit)
> >  		ret = (u32)ret;
> > +
> >  	kvm_rax_write(vcpu, ret);
> >  
> >  	return kvm_skip_emulated_instruction(vcpu);
> > 
> 
> It appears below two cases are not covered correctly?
> 
> #ifdef CONFIG_X86_64    
>         case KVM_HC_CLOCK_PAIRING:
>                 ret = kvm_pv_clock_pairing(vcpu, a0, a1);
>                 break;
> #endif
>         case KVM_HC_SEND_IPI:
>                 if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SEND_IPI))
>                         break;
> 
>                 ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
>                 break; 
> 
> Looking at the code, seems at least for KVM_HC_CLOCK_PAIRING,
> kvm_pv_clock_pairing() returns 0 on success, and the upstream behaviour is not
> routing to userspace but writing 0 to vcpu's RAX and returning to guest.  With
> the change above, it immediately returns to userspace w/o writing 0 to RAX.
> 
> For KVM_HC_SEND_IPI, seems the upstream behaviour is the return value of
> kvm_pv_send_ipi() is always written to vcpu's RAX reg, and we always just go
> back to guest.  With your change, the behaviour will be changed to:
> 
>   1) when ret == 0, exit to userspace w/o writing  0 to vcpu's RAX,
>   2) when ret == 1, it is converted to 0 and then written to RAX.
> 
> This doesn't look like safe.

Drat, I managed to space on the cases that didn't explicit set '0'.  Hrm.

My other idea was have an out-param to separate the return code intended for KVM
from the return code intended for the guest.  I generally dislike out-params, but
trying to juggle a return value that multiplexes guest and host values seems like
an even worse idea.

Also completely untested...

---
 arch/x86/include/asm/kvm_host.h |  8 +++----
 arch/x86/kvm/x86.c              | 41 +++++++++++++++------------------
 2 files changed, 23 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d9f763a7bb9..226df5c56811 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2179,10 +2179,10 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
 	kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
 }
 
-unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
-				      unsigned long a0, unsigned long a1,
-				      unsigned long a2, unsigned long a3,
-				      int op_64_bit, int cpl);
+int __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+			    unsigned long a0, unsigned long a1,
+			    unsigned long a2, unsigned long a3,
+			    int op_64_bit, int cpl, unsigned long *ret);
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e09daa3b157c..e9ae09f1b45b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9998,13 +9998,11 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
-unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
-				      unsigned long a0, unsigned long a1,
-				      unsigned long a2, unsigned long a3,
-				      int op_64_bit, int cpl)
+int __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+			    unsigned long a0, unsigned long a1,
+			    unsigned long a2, unsigned long a3,
+			    int op_64_bit, int cpl, unsigned long *ret)
 {
-	unsigned long ret;
-
 	trace_kvm_hypercall(nr, a0, a1, a2, a3);
 
 	if (!op_64_bit) {
@@ -10016,15 +10014,15 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 	}
 
 	if (cpl) {
-		ret = -KVM_EPERM;
+		*ret = -KVM_EPERM;
 		goto out;
 	}
 
-	ret = -KVM_ENOSYS;
+	*ret = -KVM_ENOSYS;
 
 	switch (nr) {
 	case KVM_HC_VAPIC_POLL_IRQ:
-		ret = 0;
+		*ret = 0;
 		break;
 	case KVM_HC_KICK_CPU:
 		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_UNHALT))
@@ -10032,36 +10030,36 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 
 		kvm_pv_kick_cpu_op(vcpu->kvm, a1);
 		kvm_sched_yield(vcpu, a1);
-		ret = 0;
+		*ret = 0;
 		break;
 #ifdef CONFIG_X86_64
 	case KVM_HC_CLOCK_PAIRING:
-		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
+		*ret = kvm_pv_clock_pairing(vcpu, a0, a1);
 		break;
 #endif
 	case KVM_HC_SEND_IPI:
 		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SEND_IPI))
 			break;
 
-		ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
+		*ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
 		break;
 	case KVM_HC_SCHED_YIELD:
 		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SCHED_YIELD))
 			break;
 
 		kvm_sched_yield(vcpu, a0);
-		ret = 0;
+		*ret = 0;
 		break;
 	case KVM_HC_MAP_GPA_RANGE: {
 		u64 gpa = a0, npages = a1, attrs = a2;
 
-		ret = -KVM_ENOSYS;
+		*ret = -KVM_ENOSYS;
 		if (!user_exit_on_hypercall(vcpu->kvm, KVM_HC_MAP_GPA_RANGE))
 			break;
 
 		if (!PAGE_ALIGNED(gpa) || !npages ||
 		    gpa_to_gfn(gpa) + npages <= gpa_to_gfn(gpa)) {
-			ret = -KVM_EINVAL;
+			*ret = -KVM_EINVAL;
 			break;
 		}
 
@@ -10080,13 +10078,13 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 		return 0;
 	}
 	default:
-		ret = -KVM_ENOSYS;
+		*ret = -KVM_ENOSYS;
 		break;
 	}
 
 out:
 	++vcpu->stat.hypercalls;
-	return ret;
+	return 1;
 }
 EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
 
@@ -10094,7 +10092,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
 	unsigned long nr, a0, a1, a2, a3, ret;
 	int op_64_bit;
-	int cpl;
+	int cpl, r;
 
 	if (kvm_xen_hypercall_enabled(vcpu->kvm))
 		return kvm_xen_hypercall(vcpu);
@@ -10110,10 +10108,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	op_64_bit = is_64_bit_hypercall(vcpu);
 	cpl = kvm_x86_call(get_cpl)(vcpu);
 
-	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
-	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
-		/* MAP_GPA tosses the request to the user space. */
-		return 0;
+	r = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, &ret);
+	if (r <= r)
+		return r;
 
 	if (!op_64_bit)
 		ret = (u32)ret;

base-commit: 675248928970d33f7fc8ca9851a170c98f4f1c4f
--
Sean Christopherson Oct. 31, 2024, 3:11 p.m. UTC | #6
On Thu, Oct 31, 2024, Binbin Wu wrote:
> On 10/31/2024 4:49 AM, Sean Christopherson wrote:
> > On Mon, Aug 26, 2024, Binbin Wu wrote:
> > > Check whether a KVM hypercall needs to exit to userspace or not based on
> > > hypercall_exit_enabled field of struct kvm_arch.
> > > 
> > > Userspace can request a hypercall to exit to userspace for handling by
> > > enable KVM_CAP_EXIT_HYPERCALL and the enabled hypercall will be set in
> > > hypercall_exit_enabled.  Make the check code generic based on it.
> > > 
> > > Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> > > Reviewed-by: Kai Huang <kai.huang@intel.com>
> > > ---
> > >   arch/x86/kvm/x86.c | 5 +++--
> > >   arch/x86/kvm/x86.h | 4 ++++
> > >   2 files changed, 7 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 966fb301d44b..e521f14ad2b2 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -10220,8 +10220,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> > >   	cpl = kvm_x86_call(get_cpl)(vcpu);
> > >   	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> > > -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> > > -		/* MAP_GPA tosses the request to the user space. */
> > > +	/* Check !ret first to make sure nr is a valid KVM hypercall. */
> > > +	if (!ret && user_exit_on_hypercall(vcpu->kvm, nr))
> > I don't love that the caller has to re-check for user_exit_on_hypercall().
> Agree, it is not ideal.
> 
> But if __kvm_emulate_hypercall() returns 0 to indicate user exit and 1 to
> indicate success, then the callers have to convert the return code to set
> return value for guest.  E.g., TDX code also needs to do the conversion.

Yeah, it's ugly.

> > @@ -10111,12 +10111,21 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >   	cpl = kvm_x86_call(get_cpl)(vcpu);
> >   	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> > -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> > -		/* MAP_GPA tosses the request to the user space. */
> > +	if (!ret)
> >   		return 0;
> > -	if (!op_64_bit)
> > +	/*
> > +	 * KVM's ABI with the guest is that '0' is success, and any other value
> > +	 * is an error code.  Internally, '0' == exit to userspace (see above)
> > +	 * and '1' == success, as KVM's de facto standard return codes are that
> > +	 * plus -errno == failure.  Explicitly check for '1' as some PV error
> > +	 * codes are positive values.
> > +	 */
> I didn't understand the last sentence:
> "Explicitly check for '1' as some PV error codes are positive values."
> 
> The functions called in __kvm_emulate_hypercall() for PV features return
> -KVM_EXXX for error code.
> Did you mean the functions like kvm_pv_enable_async_pf(), which return
> 1 for error, would be called in __kvm_emulate_hypercall() in the future?
> If this is the concern, then we cannot simply convert 1 to 0 then.

No, it was a brain fart on my end.  I was thinking of these #defines, and managed
to forget that KVM always returns -KVM_Exxx.  /facepalm

  #define KVM_ENOSYS		1000
  #define KVM_EFAULT		EFAULT
  #define KVM_EINVAL		EINVAL
  #define KVM_E2BIG		E2BIG
  #define KVM_EPERM		EPERM
  #define KVM_EOPNOTSUPP		95

> 
> > +	if (ret == 1)
> > +		ret = 0;
> > +	else if (!op_64_bit)
> >   		ret = (u32)ret;
> > +
> >   	kvm_rax_write(vcpu, ret);
> >   	return kvm_skip_emulated_instruction(vcpu);
> > 
> > base-commit: 675248928970d33f7fc8ca9851a170c98f4f1c4f
>
Binbin Wu Nov. 1, 2024, 2:25 a.m. UTC | #7
On 10/31/2024 10:54 PM, Sean Christopherson wrote:
> My other idea was have an out-param to separate the return code intended for KVM
> from the return code intended for the guest.  I generally dislike out-params, but
> trying to juggle a return value that multiplexes guest and host values seems like
> an even worse idea.
>
> Also completely untested...
>
> ---
>   arch/x86/include/asm/kvm_host.h |  8 +++----
>   arch/x86/kvm/x86.c              | 41 +++++++++++++++------------------
>   2 files changed, 23 insertions(+), 26 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6d9f763a7bb9..226df5c56811 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2179,10 +2179,10 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
>   	kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
>   }
>   
> -unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> -				      unsigned long a0, unsigned long a1,
> -				      unsigned long a2, unsigned long a3,
> -				      int op_64_bit, int cpl);
> +int __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> +			    unsigned long a0, unsigned long a1,
> +			    unsigned long a2, unsigned long a3,
> +			    int op_64_bit, int cpl, unsigned long *ret);
>   int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
>   
>   int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e09daa3b157c..e9ae09f1b45b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9998,13 +9998,11 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
>   	return kvm_skip_emulated_instruction(vcpu);
>   }
>   
> -unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> -				      unsigned long a0, unsigned long a1,
> -				      unsigned long a2, unsigned long a3,
> -				      int op_64_bit, int cpl)
> +int __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> +			    unsigned long a0, unsigned long a1,
> +			    unsigned long a2, unsigned long a3,
> +			    int op_64_bit, int cpl, unsigned long *ret)
>   {
> -	unsigned long ret;
> -
>   	trace_kvm_hypercall(nr, a0, a1, a2, a3);
>   
>   	if (!op_64_bit) {
> @@ -10016,15 +10014,15 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>   	}
>   
>   	if (cpl) {
> -		ret = -KVM_EPERM;
> +		*ret = -KVM_EPERM;
>   		goto out;
>   	}
>   
> -	ret = -KVM_ENOSYS;
> +	*ret = -KVM_ENOSYS;
>   
>   	switch (nr) {
>   	case KVM_HC_VAPIC_POLL_IRQ:
> -		ret = 0;
> +		*ret = 0;
>   		break;
>   	case KVM_HC_KICK_CPU:
>   		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_UNHALT))
> @@ -10032,36 +10030,36 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>   
>   		kvm_pv_kick_cpu_op(vcpu->kvm, a1);
>   		kvm_sched_yield(vcpu, a1);
> -		ret = 0;
> +		*ret = 0;
>   		break;
>   #ifdef CONFIG_X86_64
>   	case KVM_HC_CLOCK_PAIRING:
> -		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
> +		*ret = kvm_pv_clock_pairing(vcpu, a0, a1);
>   		break;
>   #endif
>   	case KVM_HC_SEND_IPI:
>   		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SEND_IPI))
>   			break;
>   
> -		ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
> +		*ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
>   		break;
>   	case KVM_HC_SCHED_YIELD:
>   		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SCHED_YIELD))
>   			break;
>   
>   		kvm_sched_yield(vcpu, a0);
> -		ret = 0;
> +		*ret = 0;
>   		break;
>   	case KVM_HC_MAP_GPA_RANGE: {
>   		u64 gpa = a0, npages = a1, attrs = a2;
>   
> -		ret = -KVM_ENOSYS;
> +		*ret = -KVM_ENOSYS;
>   		if (!user_exit_on_hypercall(vcpu->kvm, KVM_HC_MAP_GPA_RANGE))
>   			break;
>   
>   		if (!PAGE_ALIGNED(gpa) || !npages ||
>   		    gpa_to_gfn(gpa) + npages <= gpa_to_gfn(gpa)) {
> -			ret = -KVM_EINVAL;
> +			*ret = -KVM_EINVAL;
>   			break;
>   		}

*ret needs to be set to 0 for this case before returning 0 to caller?

>   
> @@ -10080,13 +10078,13 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>   		return 0;
>   	}
>   	default:
> -		ret = -KVM_ENOSYS;
> +		*ret = -KVM_ENOSYS;
>   		break;
>   	}
>   
>   out:
>   	++vcpu->stat.hypercalls;
> -	return ret;
> +	return 1;
>   }
>   EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
>   
> @@ -10094,7 +10092,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   {
>   	unsigned long nr, a0, a1, a2, a3, ret;
>   	int op_64_bit;
> -	int cpl;
> +	int cpl, r;
>   
>   	if (kvm_xen_hypercall_enabled(vcpu->kvm))
>   		return kvm_xen_hypercall(vcpu);
> @@ -10110,10 +10108,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   	op_64_bit = is_64_bit_hypercall(vcpu);
>   	cpl = kvm_x86_call(get_cpl)(vcpu);
>   
> -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> -		/* MAP_GPA tosses the request to the user space. */
> -		return 0;
> +	r = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, &ret);
> +	if (r <= r)
A typo here.
I guess it meant to be "if (r <= ret)" ?

So the combinations will be
----------------------------------------------------------------------------
    |  r  |    ret    | r <= ret |
---|-----|-----------|----------|-------------------------------------------
  1 |  0  |     0     |   true   |  return r, which is 0, exit to userspace
---|-----|-----------|----------|-------------------------------------------
  2 |  1  |     0     |   false  |  set vcpu's RAX and return back to guest
---|-----|-----------|----------|-------------------------------------------
  3 |  1  | -KVM_Exxx |   false  |  set vcpu's RAX and return back to guest
---|-----|-----------|----------|-------------------------------------------
  4 |  1  |  Positive |   true   |  return r, which is 1,
    |     |     N     |          |  back to guest without setting vcpu's RAX
----------------------------------------------------------------------------

KVM_HC_SEND_IPI, which calls kvm_pv_send_ipi() can hit case 4, which will
return back to guest without setting RAX. It is different from the current behavior.

r can be 0 only if there is no other error detected during pre-checks.
I think it can just check whether r is 0 or not.
I.e.,

@@ -10094,7 +10092,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
  {
      unsigned long nr, a0, a1, a2, a3, ret;
      int op_64_bit;
-    int cpl;
+    int cpl, r;

      if (kvm_xen_hypercall_enabled(vcpu->kvm))
          return kvm_xen_hypercall(vcpu);
@@ -10110,10 +10108,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
      op_64_bit = is_64_bit_hypercall(vcpu);
      cpl = kvm_x86_call(get_cpl)(vcpu);

-    ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
-    if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
-        /* MAP_GPA tosses the request to the user space. */
-        return 0;
+    r = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, &ret);
+    if (!r)
+        return 0;

      if (!op_64_bit)
          ret = (u32)ret;


> +		return r;
>   
>   	if (!op_64_bit)
>   		ret = (u32)ret;
>
> base-commit: 675248928970d33f7fc8ca9851a170c98f4f1c4f
Kai Huang Nov. 1, 2024, 11:05 a.m. UTC | #8
On Thu, 2024-10-31 at 07:54 -0700, Sean Christopherson wrote:
> On Thu, Oct 31, 2024, Kai Huang wrote:
> > > @@ -10111,12 +10111,21 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> > >  	cpl = kvm_x86_call(get_cpl)(vcpu);
> > >  
> > >  	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> > > -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> > > -		/* MAP_GPA tosses the request to the user space. */
> > > +	if (!ret)
> > >  		return 0;
> > >  
> > > -	if (!op_64_bit)
> > > +	/*
> > > +	 * KVM's ABI with the guest is that '0' is success, and any other value
> > > +	 * is an error code.  Internally, '0' == exit to userspace (see above)
> > > +	 * and '1' == success, as KVM's de facto standard return codes are that
> > > +	 * plus -errno == failure.  Explicitly check for '1' as some PV error
> > > +	 * codes are positive values.
> > > +	 */
> > > +	if (ret == 1)
> > > +		ret = 0;
> > > +	else if (!op_64_bit)
> > >  		ret = (u32)ret;
> > > +
> > >  	kvm_rax_write(vcpu, ret);
> > >  
> > >  	return kvm_skip_emulated_instruction(vcpu);
> > > 
> > 
> > It appears below two cases are not covered correctly?
> > 
> > #ifdef CONFIG_X86_64    
> >         case KVM_HC_CLOCK_PAIRING:
> >                 ret = kvm_pv_clock_pairing(vcpu, a0, a1);
> >                 break;
> > #endif
> >         case KVM_HC_SEND_IPI:
> >                 if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SEND_IPI))
> >                         break;
> > 
> >                 ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
> >                 break; 
> > 
> > Looking at the code, seems at least for KVM_HC_CLOCK_PAIRING,
> > kvm_pv_clock_pairing() returns 0 on success, and the upstream behaviour is not
> > routing to userspace but writing 0 to vcpu's RAX and returning to guest.  With
> > the change above, it immediately returns to userspace w/o writing 0 to RAX.
> > 
> > For KVM_HC_SEND_IPI, seems the upstream behaviour is the return value of
> > kvm_pv_send_ipi() is always written to vcpu's RAX reg, and we always just go
> > back to guest.  With your change, the behaviour will be changed to:
> > 
> >   1) when ret == 0, exit to userspace w/o writing  0 to vcpu's RAX,
> >   2) when ret == 1, it is converted to 0 and then written to RAX.
> > 
> > This doesn't look like safe.
> 
> Drat, I managed to space on the cases that didn't explicit set '0'.  Hrm.
> 
> My other idea was have an out-param to separate the return code intended for KVM
> from the return code intended for the guest.  I generally dislike out-params, but
> trying to juggle a return value that multiplexes guest and host values seems like
> an even worse idea.

Yeah this looks fine to me, one comment below ...


[...]

> -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> -		/* MAP_GPA tosses the request to the user space. */
> -		return 0;
> +	r = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, &ret);
> +	if (r <= r)
> +		return r;

... should be:

	if (r <= 0)
		return r;

?

Another option might be we move "set hypercall return value" code inside
__kvm_emulate_hypercall().  So IIUC the reason to split
__kvm_emulate_hypercall() out is for TDX, and while non-TDX uses RAX to carry
the hypercall return value, TDX uses R10.

We can additionally pass a "kvm_hypercall_set_ret_func" function pointer to
__kvm_emulate_hypercall(), and invoke it inside.  Then we can change
__kvm_emulate_hypercall() to return: 
    < 0 error, 
    ==0 return to userspace, 
    > 0 go back to guest.

I made some attempt below, build test only:

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d9f763a7bb9..c48feb62a5d7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2179,10 +2179,14 @@ static inline void kvm_clear_apicv_inhibit(struct kvm
*kvm,
        kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
 }
 
-unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
-                                     unsigned long a0, unsigned long a1,
-                                     unsigned long a2, unsigned long a3,
-                                     int op_64_bit, int cpl);
+typedef void (*kvm_hypercall_set_ret_func_t)(struct kvm_vcpu *vcpu,
+               unsigned long val);
+
+int __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+                           unsigned long a0, unsigned long a1,
+                           unsigned long a2, unsigned long a3,
+                           int op_64_bit, int cpl,
+                           kvm_hypercall_set_ret_func_t set_ret_func);
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83fe0a78146f..01bdf01cfc79 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9998,10 +9998,11 @@ static int complete_hypercall_exit(struct kvm_vcpu
*vcpu)
        return kvm_skip_emulated_instruction(vcpu);
 }
 
-unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
-                                     unsigned long a0, unsigned long a1,
-                                     unsigned long a2, unsigned long a3,
-                                     int op_64_bit, int cpl)
+int __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+                           unsigned long a0, unsigned long a1,
+                           unsigned long a2, unsigned long a3,
+                           int op_64_bit, int cpl,
+                           kvm_hypercall_set_ret_func_t set_ret_func)
 {
        unsigned long ret;
 
@@ -10076,6 +10077,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu
*vcpu, unsigned long nr,
 
                WARN_ON_ONCE(vcpu->run->hypercall.flags &
KVM_EXIT_HYPERCALL_MBZ);
                vcpu->arch.complete_userspace_io = complete_hypercall_exit;
+               /* MAP_GPA tosses the request to the user space. */
                /* stat is incremented on completion. */
                return 0;
        }
@@ -10085,16 +10087,26 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu
*vcpu, unsigned long nr,
        }
 
 out:
+       if (!op_64_bit)
+               ret = (u32)ret;
+       (*set_ret_func)(vcpu, ret);
+
        ++vcpu->stat.hypercalls;
-       return ret;
+       return 1;
 }
 EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
 
+static void kvm_hypercall_set_ret_default(struct kvm_vcpu *vcpu,
+                                         unsigned long val)
+{
+       kvm_rax_write(vcpu, val);
+}
+
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
-       unsigned long nr, a0, a1, a2, a3, ret;
+       unsigned long nr, a0, a1, a2, a3;
        int op_64_bit;
-       int cpl;
+       int cpl, ret;
 
        if (kvm_xen_hypercall_enabled(vcpu->kvm))
                return kvm_xen_hypercall(vcpu);
@@ -10110,14 +10122,10 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
        op_64_bit = is_64_bit_hypercall(vcpu);
        cpl = kvm_x86_call(get_cpl)(vcpu);
 
-       ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
-       if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
-               /* MAP_GPA tosses the request to the user space. */
-               return 0;
-
-       if (!op_64_bit)
-               ret = (u32)ret;
-       kvm_rax_write(vcpu, ret);
+       ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl,
+                       kvm_hypercall_set_ret_default);
+       if (ret <= 0)
+               return ret;
 
        return kvm_skip_emulated_instruction(vcpu);
 }
Sean Christopherson Nov. 1, 2024, 3:26 p.m. UTC | #9
On Fri, Nov 01, 2024, Binbin Wu wrote:
> On 10/31/2024 10:54 PM, Sean Christopherson wrote:
> > My other idea was have an out-param to separate the return code intended for KVM
> > from the return code intended for the guest.  I generally dislike out-params, but
> > trying to juggle a return value that multiplexes guest and host values seems like
> > an even worse idea.
> > 
> > Also completely untested...

...

> >   	case KVM_HC_MAP_GPA_RANGE: {
> >   		u64 gpa = a0, npages = a1, attrs = a2;
> > -		ret = -KVM_ENOSYS;
> > +		*ret = -KVM_ENOSYS;
> >   		if (!user_exit_on_hypercall(vcpu->kvm, KVM_HC_MAP_GPA_RANGE))
> >   			break;
> >   		if (!PAGE_ALIGNED(gpa) || !npages ||
> >   		    gpa_to_gfn(gpa) + npages <= gpa_to_gfn(gpa)) {
> > -			ret = -KVM_EINVAL;
> > +			*ret = -KVM_EINVAL;
> >   			break;
> >   		}
> 
> *ret needs to be set to 0 for this case before returning 0 to caller?

No, because the caller should consume *ret if and only if the function return value
is '1', i.e. iff KVM should resume the guest.  And I think we actually want to
intentionally not touch *ret, because a sufficient smart compiler (or static
analysis tool) should be able to detect that incorrect usage of *ret is consuming
uninitialized data.

> > @@ -10080,13 +10078,13 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> >   		return 0;
> >   	}
> >   	default:
> > -		ret = -KVM_ENOSYS;
> > +		*ret = -KVM_ENOSYS;
> >   		break;
> >   	}
> >   out:
> >   	++vcpu->stat.hypercalls;
> > -	return ret;
> > +	return 1;
> >   }
> >   EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
> > @@ -10094,7 +10092,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >   {
> >   	unsigned long nr, a0, a1, a2, a3, ret;
> >   	int op_64_bit;
> > -	int cpl;
> > +	int cpl, r;
> >   	if (kvm_xen_hypercall_enabled(vcpu->kvm))
> >   		return kvm_xen_hypercall(vcpu);
> > @@ -10110,10 +10108,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >   	op_64_bit = is_64_bit_hypercall(vcpu);
> >   	cpl = kvm_x86_call(get_cpl)(vcpu);
> > -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> > -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> > -		/* MAP_GPA tosses the request to the user space. */
> > -		return 0;
> > +	r = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, &ret);
> > +	if (r <= r)
> A typo here.
> I guess it meant to be "if (r <= ret)" ?

No, "if (r <= 0)", i.e. exit to userspace on 0 or -errno.

> So the combinations will be
> ----------------------------------------------------------------------------
>    |  r  |    ret    | r <= ret |
> ---|-----|-----------|----------|-------------------------------------------
>  1 |  0  |     0     |   true   |  return r, which is 0, exit to userspace
> ---|-----|-----------|----------|-------------------------------------------
>  2 |  1  |     0     |   false  |  set vcpu's RAX and return back to guest
> ---|-----|-----------|----------|-------------------------------------------
>  3 |  1  | -KVM_Exxx |   false  |  set vcpu's RAX and return back to guest
> ---|-----|-----------|----------|-------------------------------------------
>  4 |  1  |  Positive |   true   |  return r, which is 1,
>    |     |     N     |          |  back to guest without setting vcpu's RAX
> ----------------------------------------------------------------------------
> 
> KVM_HC_SEND_IPI, which calls kvm_pv_send_ipi() can hit case 4, which will
> return back to guest without setting RAX. It is different from the current behavior.
> 
> r can be 0 only if there is no other error detected during pre-checks.
> I think it can just check whether r is 0 or not.

Yeah, I just fat fingered the code (and didn't even compile test).
Sean Christopherson Nov. 1, 2024, 4:39 p.m. UTC | #10
On Fri, Nov 01, 2024, Kai Huang wrote:
> On Thu, 2024-10-31 at 07:54 -0700, Sean Christopherson wrote:
> > On Thu, Oct 31, 2024, Kai Huang wrote:
> > -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> > -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> > -		/* MAP_GPA tosses the request to the user space. */
> > -		return 0;
> > +	r = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, &ret);
> > +	if (r <= r)
> > +		return r;
> 
> ... should be:
> 
> 	if (r <= 0)
> 		return r;
> 
> ?
> 
> Another option might be we move "set hypercall return value" code inside
> __kvm_emulate_hypercall().  So IIUC the reason to split
> __kvm_emulate_hypercall() out is for TDX, and while non-TDX uses RAX to carry
> the hypercall return value, TDX uses R10.
> 
> We can additionally pass a "kvm_hypercall_set_ret_func" function pointer to
> __kvm_emulate_hypercall(), and invoke it inside.  Then we can change
> __kvm_emulate_hypercall() to return: 
>     < 0 error, 
>     ==0 return to userspace, 
>     > 0 go back to guest.

Hmm, and the caller can still handle kvm_skip_emulated_instruction(), because the
return value is KVM's normal pattern.

I like it!

But, there's no need to pass a function pointer, KVM can write (and read) arbitrary
GPRs, it's just avoided in most cases so that the sanity checks and available/dirty
updates are elided.  For this code though, it's easy enough to keep kvm_rxx_read()
for getting values, and eating the overhead of a single GPR write is a perfectly
fine tradeoff for eliminating the return multiplexing.

Lightly tested.  Assuming this works for TDX and passes testing, I'll post a
mini-series next week.

--
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 1 Nov 2024 09:04:00 -0700
Subject: [PATCH] KVM: x86: Refactor __kvm_emulate_hypercall() to accept reg
 names, not values

Rework __kvm_emulate_hypercall() to take the names of input and output
(guest return value) registers, as opposed to taking the input values and
returning the output value.  As part of the refactor, change the actual
return value from __kvm_emulate_hypercall() to be KVM's de facto standard
of '0' == exit to userspace, '1' == resume guest, and -errno == failure.

Using the return value for KVM's control flow eliminates the multiplexed
return value, where '0' for KVM_HC_MAP_GPA_RANGE (and only that hypercall)
means "exit to userspace".

Use the direct GPR accessors to read values to avoid the pointless marking
of the registers as available, but use kvm_register_write_raw() for the
guest return value so that the innermost helper doesn't need to multiplex
its return value.  Using the generic kvm_register_write_raw() adds very
minimal overhead, so as a one-off in a relatively slow path it's well
worth the code simplification.

Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 15 +++++++++----
 arch/x86/kvm/x86.c              | 40 +++++++++++++--------------------
 2 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d9f763a7bb9..9e66fde1c4e4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2179,10 +2179,17 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
 	kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
 }
 
-unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
-				      unsigned long a0, unsigned long a1,
-				      unsigned long a2, unsigned long a3,
-				      int op_64_bit, int cpl);
+int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+			      unsigned long a0, unsigned long a1,
+			      unsigned long a2, unsigned long a3,
+			      int op_64_bit, int cpl, int ret_reg);
+
+#define __kvm_emulate_hypercall(_vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, ret)	\
+	____kvm_emulate_hypercall(vcpu,						\
+				  kvm_##nr##_read(vcpu), kvm_##a0##_read(vcpu),	\
+				  kvm_##a1##_read(vcpu), kvm_##a2##_read(vcpu),	\
+				  kvm_##a3##_read(vcpu), op_64_bit, cpl, VCPU_REGS_##ret)
+
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e09daa3b157c..425a301911a6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9998,10 +9998,10 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
-unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
-				      unsigned long a0, unsigned long a1,
-				      unsigned long a2, unsigned long a3,
-				      int op_64_bit, int cpl)
+int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+			      unsigned long a0, unsigned long a1,
+			      unsigned long a2, unsigned long a3,
+			      int op_64_bit, int cpl, int ret_reg)
 {
 	unsigned long ret;
 
@@ -10086,15 +10086,18 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 
 out:
 	++vcpu->stat.hypercalls;
-	return ret;
+
+	if (!op_64_bit)
+		ret = (u32)ret;
+
+	kvm_register_write_raw(vcpu, ret_reg, ret);
+	return 1;
 }
-EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
+EXPORT_SYMBOL_GPL(____kvm_emulate_hypercall);
 
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
-	unsigned long nr, a0, a1, a2, a3, ret;
-	int op_64_bit;
-	int cpl;
+	int r;
 
 	if (kvm_xen_hypercall_enabled(vcpu->kvm))
 		return kvm_xen_hypercall(vcpu);
@@ -10102,23 +10105,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	if (kvm_hv_hypercall_enabled(vcpu))
 		return kvm_hv_hypercall(vcpu);
 
-	nr = kvm_rax_read(vcpu);
-	a0 = kvm_rbx_read(vcpu);
-	a1 = kvm_rcx_read(vcpu);
-	a2 = kvm_rdx_read(vcpu);
-	a3 = kvm_rsi_read(vcpu);
-	op_64_bit = is_64_bit_hypercall(vcpu);
-	cpl = kvm_x86_call(get_cpl)(vcpu);
-
-	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
-	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
-		/* MAP_GPA tosses the request to the user space. */
+	r = __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
+				    is_64_bit_hypercall(vcpu),
+				    kvm_x86_call(get_cpl)(vcpu), RAX);
+	if (r <= 0)
 		return 0;
 
-	if (!op_64_bit)
-		ret = (u32)ret;
-	kvm_rax_write(vcpu, ret);
-
 	return kvm_skip_emulated_instruction(vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);

base-commit: 911785b796e325dec83b32050f294e278a306211
--
Kai Huang Nov. 1, 2024, 9:13 p.m. UTC | #11
On Fri, 2024-11-01 at 09:39 -0700, Sean Christopherson wrote:
> On Fri, Nov 01, 2024, Kai Huang wrote:
> > On Thu, 2024-10-31 at 07:54 -0700, Sean Christopherson wrote:
> > > On Thu, Oct 31, 2024, Kai Huang wrote:
> > > -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> > > -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> > > -		/* MAP_GPA tosses the request to the user space. */
> > > -		return 0;
> > > +	r = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, &ret);
> > > +	if (r <= r)
> > > +		return r;
> > 
> > ... should be:
> > 
> > 	if (r <= 0)
> > 		return r;
> > 
> > ?
> > 
> > Another option might be we move "set hypercall return value" code inside
> > __kvm_emulate_hypercall().  So IIUC the reason to split
> > __kvm_emulate_hypercall() out is for TDX, and while non-TDX uses RAX to carry
> > the hypercall return value, TDX uses R10.
> > 
> > We can additionally pass a "kvm_hypercall_set_ret_func" function pointer to
> > __kvm_emulate_hypercall(), and invoke it inside.  Then we can change
> > __kvm_emulate_hypercall() to return: 
> >     < 0 error, 
> >     ==0 return to userspace, 
> >     > 0 go back to guest.
> 
> Hmm, and the caller can still handle kvm_skip_emulated_instruction(), because the
> return value is KVM's normal pattern.
> 
> I like it!
> 
> But, there's no need to pass a function pointer, KVM can write (and read) arbitrary
> GPRs, it's just avoided in most cases so that the sanity checks and available/dirty
> updates are elided.  For this code though, it's easy enough to keep kvm_rxx_read()
> for getting values, and eating the overhead of a single GPR write is a perfectly
> fine tradeoff for eliminating the return multiplexing.
> 
> Lightly tested.  Assuming this works for TDX and passes testing, I'll post a
> mini-series next week.
> 
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Fri, 1 Nov 2024 09:04:00 -0700
> Subject: [PATCH] KVM: x86: Refactor __kvm_emulate_hypercall() to accept reg
>  names, not values
> 
> Rework __kvm_emulate_hypercall() to take the names of input and output
> (guest return value) registers, as opposed to taking the input values and
> returning the output value.  As part of the refactor, change the actual
> return value from __kvm_emulate_hypercall() to be KVM's de facto standard
> of '0' == exit to userspace, '1' == resume guest, and -errno == failure.
> 
> Using the return value for KVM's control flow eliminates the multiplexed
> return value, where '0' for KVM_HC_MAP_GPA_RANGE (and only that hypercall)
> means "exit to userspace".
> 
> Use the direct GPR accessors to read values to avoid the pointless marking
> of the registers as available, but use kvm_register_write_raw() for the
> guest return value so that the innermost helper doesn't need to multiplex
> its return value.  Using the generic kvm_register_write_raw() adds very
> minimal overhead, so as a one-off in a relatively slow path it's well
> worth the code simplification.

Ah right :-)

> 
> Suggested-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

I think Binbin can help to test on TDX, and assuming it works,

Reviewed-by: Kai Huang <kai.huang@intel.com>
Binbin Wu Nov. 4, 2024, 8:49 a.m. UTC | #12
On 11/2/2024 12:39 AM, Sean Christopherson wrote:
> On Fri, Nov 01, 2024, Kai Huang wrote:
>> On Thu, 2024-10-31 at 07:54 -0700, Sean Christopherson wrote:
>>> On Thu, Oct 31, 2024, Kai Huang wrote:
>>> -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
>>> -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
>>> -		/* MAP_GPA tosses the request to the user space. */
>>> -		return 0;
>>> +	r = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, &ret);
>>> +	if (r <= r)
>>> +		return r;
>> ... should be:
>>
>> 	if (r <= 0)
>> 		return r;
>>
>> ?
>>
>> Another option might be we move "set hypercall return value" code inside
>> __kvm_emulate_hypercall().  So IIUC the reason to split
>> __kvm_emulate_hypercall() out is for TDX, and while non-TDX uses RAX to carry
>> the hypercall return value, TDX uses R10.
>>
>> We can additionally pass a "kvm_hypercall_set_ret_func" function pointer to
>> __kvm_emulate_hypercall(), and invoke it inside.  Then we can change
>> __kvm_emulate_hypercall() to return:
>>      < 0 error,
>>      ==0 return to userspace,
>>      > 0 go back to guest.
> Hmm, and the caller can still handle kvm_skip_emulated_instruction(), because the
> return value is KVM's normal pattern.
>
> I like it!
>
> But, there's no need to pass a function pointer, KVM can write (and read) arbitrary
> GPRs, it's just avoided in most cases so that the sanity checks and available/dirty
> updates are elided.  For this code though, it's easy enough to keep kvm_rxx_read()
> for getting values, and eating the overhead of a single GPR write is a perfectly
> fine tradeoff for eliminating the return multiplexing.
>
> Lightly tested.  Assuming this works for TDX and passes testing, I'll post a
> mini-series next week.
>
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Fri, 1 Nov 2024 09:04:00 -0700
> Subject: [PATCH] KVM: x86: Refactor __kvm_emulate_hypercall() to accept reg
>   names, not values
>
> Rework __kvm_emulate_hypercall() to take the names of input and output
> (guest return value) registers, as opposed to taking the input values and
> returning the output value.  As part of the refactor, change the actual
> return value from __kvm_emulate_hypercall() to be KVM's de facto standard
> of '0' == exit to userspace, '1' == resume guest, and -errno == failure.
>
> Using the return value for KVM's control flow eliminates the multiplexed
> return value, where '0' for KVM_HC_MAP_GPA_RANGE (and only that hypercall)
> means "exit to userspace".
>
> Use the direct GPR accessors to read values to avoid the pointless marking
> of the registers as available, but use kvm_register_write_raw() for the
> guest return value so that the innermost helper doesn't need to multiplex
> its return value.  Using the generic kvm_register_write_raw() adds very
> minimal overhead, so as a one-off in a relatively slow path it's well
> worth the code simplification.
>
> Suggested-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/include/asm/kvm_host.h | 15 +++++++++----
>   arch/x86/kvm/x86.c              | 40 +++++++++++++--------------------
>   2 files changed, 27 insertions(+), 28 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6d9f763a7bb9..9e66fde1c4e4 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2179,10 +2179,17 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
>   	kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
>   }
>   
> -unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> -				      unsigned long a0, unsigned long a1,
> -				      unsigned long a2, unsigned long a3,
> -				      int op_64_bit, int cpl);
> +int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> +			      unsigned long a0, unsigned long a1,
> +			      unsigned long a2, unsigned long a3,
> +			      int op_64_bit, int cpl, int ret_reg);
> +
> +#define __kvm_emulate_hypercall(_vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, ret)	\
> +	____kvm_emulate_hypercall(vcpu,						\
> +				  kvm_##nr##_read(vcpu), kvm_##a0##_read(vcpu),	\
> +				  kvm_##a1##_read(vcpu), kvm_##a2##_read(vcpu),	\
> +				  kvm_##a3##_read(vcpu), op_64_bit, cpl, VCPU_REGS_##ret)
> +
>   int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
>   
>   int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e09daa3b157c..425a301911a6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9998,10 +9998,10 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
>   	return kvm_skip_emulated_instruction(vcpu);
>   }
>   
> -unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> -				      unsigned long a0, unsigned long a1,
> -				      unsigned long a2, unsigned long a3,
> -				      int op_64_bit, int cpl)
> +int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> +			      unsigned long a0, unsigned long a1,
> +			      unsigned long a2, unsigned long a3,
> +			      int op_64_bit, int cpl, int ret_reg)
>   {
>   	unsigned long ret;
>   
> @@ -10086,15 +10086,18 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>   
>   out:
>   	++vcpu->stat.hypercalls;
> -	return ret;
> +
> +	if (!op_64_bit)
> +		ret = (u32)ret;
> +
> +	kvm_register_write_raw(vcpu, ret_reg, ret);
> +	return 1;
>   }
> -EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
> +EXPORT_SYMBOL_GPL(____kvm_emulate_hypercall);
>   
>   int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   {
> -	unsigned long nr, a0, a1, a2, a3, ret;
> -	int op_64_bit;
> -	int cpl;
> +	int r;
>   
>   	if (kvm_xen_hypercall_enabled(vcpu->kvm))
>   		return kvm_xen_hypercall(vcpu);
> @@ -10102,23 +10105,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   	if (kvm_hv_hypercall_enabled(vcpu))
>   		return kvm_hv_hypercall(vcpu);
>   
> -	nr = kvm_rax_read(vcpu);
> -	a0 = kvm_rbx_read(vcpu);
> -	a1 = kvm_rcx_read(vcpu);
> -	a2 = kvm_rdx_read(vcpu);
> -	a3 = kvm_rsi_read(vcpu);
> -	op_64_bit = is_64_bit_hypercall(vcpu);
> -	cpl = kvm_x86_call(get_cpl)(vcpu);
> -
> -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> -		/* MAP_GPA tosses the request to the user space. */
> +	r = __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
> +				    is_64_bit_hypercall(vcpu),
> +				    kvm_x86_call(get_cpl)(vcpu), RAX);
Now, the register for return code of the hypercall can be specified.
But in  ____kvm_emulate_hypercall(), the complete_userspace_io callback
is hardcoded to complete_hypercall_exit(), which always set return code
to RAX.

We can allow the caller to pass in the cui callback, or assign different
version according to the input 'ret_reg'.  So that different callers can use
different cui callbacks.  E.g., TDX needs to set return code to R10 in cui
callback.

How about:

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dba78f22ab27..0fba98685f42 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2226,13 +2226,15 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
  int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
                               unsigned long a0, unsigned long a1,
                               unsigned long a2, unsigned long a3,
-                             int op_64_bit, int cpl, int ret_reg);
+                             int op_64_bit, int cpl, int ret_reg,
+                             int (*cui)(struct kvm_vcpu *vcpu));

-#define __kvm_emulate_hypercall(_vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, ret)        \
+#define __kvm_emulate_hypercall(_vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, ret, cui)   \
____kvm_emulate_hypercall(vcpu, \
                                   kvm_##nr##_read(vcpu), kvm_##a0##_read(vcpu), \
                                   kvm_##a1##_read(vcpu), kvm_##a2##_read(vcpu), \
-                                 kvm_##a3##_read(vcpu), op_64_bit, cpl, VCPU_REGS_##ret)
+                                 kvm_##a3##_read(vcpu), op_64_bit, cpl, VCPU_REGS_##ret, \
+                                 cui)

  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6e0a518aec4a..b68690c4a4c0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10019,7 +10019,8 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
  int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
                               unsigned long a0, unsigned long a1,
                               unsigned long a2, unsigned long a3,
-                             int op_64_bit, int cpl, int ret_reg)
+                             int op_64_bit, int cpl, int ret_reg,
+                             int (*cui)(struct kvm_vcpu *vcpu))
  {
         unsigned long ret;

@@ -10093,7 +10094,7 @@ int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
                         vcpu->run->hypercall.flags |= KVM_EXIT_HYPERCALL_LONG_MODE;

                 WARN_ON_ONCE(vcpu->run->hypercall.flags & KVM_EXIT_HYPERCALL_MBZ);
-               vcpu->arch.complete_userspace_io = complete_hypercall_exit;
+               vcpu->arch.complete_userspace_io = cui;
                 /* stat is incremented on completion. */
                 return 0;
         }
@@ -10125,7 +10126,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)

         r = __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
                                     is_64_bit_hypercall(vcpu),
-                                   kvm_x86_call(get_cpl)(vcpu), RAX);
+                                   kvm_x86_call(get_cpl)(vcpu), RAX, complete_hypercall_exit);
         if (r <= 0)
                 return 0;


> +	if (r <= 0)
>   		return 0;
>   
> -	if (!op_64_bit)
> -		ret = (u32)ret;
> -	kvm_rax_write(vcpu, ret);
> -
>   	return kvm_skip_emulated_instruction(vcpu);
>   }
>   EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
>
> base-commit: 911785b796e325dec83b32050f294e278a306211
Binbin Wu Nov. 4, 2024, 9:03 a.m. UTC | #13
On 11/2/2024 5:13 AM, Huang, Kai wrote:
> On Fri, 2024-11-01 at 09:39 -0700, Sean Christopherson wrote:
>> On Fri, Nov 01, 2024, Kai Huang wrote:
>>> On Thu, 2024-10-31 at 07:54 -0700, Sean Christopherson wrote:
>>>> On Thu, Oct 31, 2024, Kai Huang wrote:
>>>> -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
>>>> -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
>>>> -		/* MAP_GPA tosses the request to the user space. */
>>>> -		return 0;
>>>> +	r = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, &ret);
>>>> +	if (r <= r)
>>>> +		return r;
>>> ... should be:
>>>
>>> 	if (r <= 0)
>>> 		return r;
>>>
>>> ?
>>>
>>> Another option might be we move "set hypercall return value" code inside
>>> __kvm_emulate_hypercall().  So IIUC the reason to split
>>> __kvm_emulate_hypercall() out is for TDX, and while non-TDX uses RAX to carry
>>> the hypercall return value, TDX uses R10.
>>>
>>> We can additionally pass a "kvm_hypercall_set_ret_func" function pointer to
>>> __kvm_emulate_hypercall(), and invoke it inside.  Then we can change
>>> __kvm_emulate_hypercall() to return:
>>>      < 0 error,
>>>      ==0 return to userspace,
>>>      > 0 go back to guest.
>> Hmm, and the caller can still handle kvm_skip_emulated_instruction(), because the
>> return value is KVM's normal pattern.
>>
>> I like it!
>>
>> But, there's no need to pass a function pointer, KVM can write (and read) arbitrary
>> GPRs, it's just avoided in most cases so that the sanity checks and available/dirty
>> updates are elided.  For this code though, it's easy enough to keep kvm_rxx_read()
>> for getting values, and eating the overhead of a single GPR write is a perfectly
>> fine tradeoff for eliminating the return multiplexing.
>>
>> Lightly tested.  Assuming this works for TDX and passes testing, I'll post a
>> mini-series next week.
>>
>> --
>> From: Sean Christopherson <seanjc@google.com>
>> Date: Fri, 1 Nov 2024 09:04:00 -0700
>> Subject: [PATCH] KVM: x86: Refactor __kvm_emulate_hypercall() to accept reg
>>   names, not values
>>
>> Rework __kvm_emulate_hypercall() to take the names of input and output
>> (guest return value) registers, as opposed to taking the input values and
>> returning the output value.  As part of the refactor, change the actual
>> return value from __kvm_emulate_hypercall() to be KVM's de facto standard
>> of '0' == exit to userspace, '1' == resume guest, and -errno == failure.
>>
>> Using the return value for KVM's control flow eliminates the multiplexed
>> return value, where '0' for KVM_HC_MAP_GPA_RANGE (and only that hypercall)
>> means "exit to userspace".
>>
>> Use the direct GPR accessors to read values to avoid the pointless marking
>> of the registers as available, but use kvm_register_write_raw() for the
>> guest return value so that the innermost helper doesn't need to multiplex
>> its return value.  Using the generic kvm_register_write_raw() adds very
>> minimal overhead, so as a one-off in a relatively slow path it's well
>> worth the code simplification.
> Ah right :-)
>
>> Suggested-by: Kai Huang <kai.huang@intel.com>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> ---
> I think Binbin can help to test on TDX, and assuming it works,
I tried to add a selftest case to do memory conversion via kvm hypercall
directly for TDX.  And found TDX code didn't handle the return value for
the hypercall properly.

I tried to add a parameter to pass the cui callback as mentioned in
https://lore.kernel.org/lkml/f95cd8c6-af5c-4d8f-99a8-16d0ec56d9a4@linux.intel.com/
And then, made the following change in TDX code to make it work.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index cd27ebd3d7d1..efa434c6547d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1072,6 +1072,15 @@ static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
         return 0;
  }

+static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
+{
+       u64 ret = vcpu->run->hypercall.ret;
+
+       kvm_r10_write(vcpu, ret);
+       ++vcpu->stat.hypercalls;
+
+       return 1;
+}
+
  static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
  {
         int r;
@@ -1087,7 +1096,7 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
          * R10: KVM hypercall number
          * arguments: R11, R12, R13, R14.
          */
-       r = __kvm_emulate_hypercall(vcpu, r10, r11, r12, r13, r14, true, 0, R10);
+       r = __kvm_emulate_hypercall(vcpu, r10, r11, r12, r13, r14, true, 0, R10, complete_hypercall_exit);

         return r > 0;
  }


>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
>
Kai Huang Nov. 4, 2024, 9:55 a.m. UTC | #14
On Mon, 2024-11-04 at 16:49 +0800, Binbin Wu wrote:
> 
> 
> On 11/2/2024 12:39 AM, Sean Christopherson wrote:
> > On Fri, Nov 01, 2024, Kai Huang wrote:
> > > On Thu, 2024-10-31 at 07:54 -0700, Sean Christopherson wrote:
> > > > On Thu, Oct 31, 2024, Kai Huang wrote:
> > > > -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> > > > -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> > > > -		/* MAP_GPA tosses the request to the user space. */
> > > > -		return 0;
> > > > +	r = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, &ret);
> > > > +	if (r <= r)
> > > > +		return r;
> > > ... should be:
> > > 
> > > 	if (r <= 0)
> > > 		return r;
> > > 
> > > ?
> > > 
> > > Another option might be we move "set hypercall return value" code inside
> > > __kvm_emulate_hypercall().  So IIUC the reason to split
> > > __kvm_emulate_hypercall() out is for TDX, and while non-TDX uses RAX to carry
> > > the hypercall return value, TDX uses R10.
> > > 
> > > We can additionally pass a "kvm_hypercall_set_ret_func" function pointer to
> > > __kvm_emulate_hypercall(), and invoke it inside.  Then we can change
> > > __kvm_emulate_hypercall() to return:
> > >      < 0 error,
> > >      ==0 return to userspace,
> > >      > 0 go back to guest.
> > Hmm, and the caller can still handle kvm_skip_emulated_instruction(), because the
> > return value is KVM's normal pattern.
> > 
> > I like it!
> > 
> > But, there's no need to pass a function pointer, KVM can write (and read) arbitrary
> > GPRs, it's just avoided in most cases so that the sanity checks and available/dirty
> > updates are elided.  For this code though, it's easy enough to keep kvm_rxx_read()
> > for getting values, and eating the overhead of a single GPR write is a perfectly
> > fine tradeoff for eliminating the return multiplexing.
> > 
> > Lightly tested.  Assuming this works for TDX and passes testing, I'll post a
> > mini-series next week.
> > 
> > --
> > From: Sean Christopherson <seanjc@google.com>
> > Date: Fri, 1 Nov 2024 09:04:00 -0700
> > Subject: [PATCH] KVM: x86: Refactor __kvm_emulate_hypercall() to accept reg
> >   names, not values
> > 
> > Rework __kvm_emulate_hypercall() to take the names of input and output
> > (guest return value) registers, as opposed to taking the input values and
> > returning the output value.  As part of the refactor, change the actual
> > return value from __kvm_emulate_hypercall() to be KVM's de facto standard
> > of '0' == exit to userspace, '1' == resume guest, and -errno == failure.
> > 
> > Using the return value for KVM's control flow eliminates the multiplexed
> > return value, where '0' for KVM_HC_MAP_GPA_RANGE (and only that hypercall)
> > means "exit to userspace".
> > 
> > Use the direct GPR accessors to read values to avoid the pointless marking
> > of the registers as available, but use kvm_register_write_raw() for the
> > guest return value so that the innermost helper doesn't need to multiplex
> > its return value.  Using the generic kvm_register_write_raw() adds very
> > minimal overhead, so as a one-off in a relatively slow path it's well
> > worth the code simplification.
> > 
> > Suggested-by: Kai Huang <kai.huang@intel.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/include/asm/kvm_host.h | 15 +++++++++----
> >   arch/x86/kvm/x86.c              | 40 +++++++++++++--------------------
> >   2 files changed, 27 insertions(+), 28 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 6d9f763a7bb9..9e66fde1c4e4 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -2179,10 +2179,17 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
> >   	kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
> >   }
> >   
> > -unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> > -				      unsigned long a0, unsigned long a1,
> > -				      unsigned long a2, unsigned long a3,
> > -				      int op_64_bit, int cpl);
> > +int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> > +			      unsigned long a0, unsigned long a1,
> > +			      unsigned long a2, unsigned long a3,
> > +			      int op_64_bit, int cpl, int ret_reg);
> > +
> > +#define __kvm_emulate_hypercall(_vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl, ret)	\
> > +	____kvm_emulate_hypercall(vcpu,						\
> > +				  kvm_##nr##_read(vcpu), kvm_##a0##_read(vcpu),	\
> > +				  kvm_##a1##_read(vcpu), kvm_##a2##_read(vcpu),	\
> > +				  kvm_##a3##_read(vcpu), op_64_bit, cpl, VCPU_REGS_##ret)
> > +
> >   int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
> >   
> >   int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index e09daa3b157c..425a301911a6 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9998,10 +9998,10 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
> >   	return kvm_skip_emulated_instruction(vcpu);
> >   }
> >   
> > -unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> > -				      unsigned long a0, unsigned long a1,
> > -				      unsigned long a2, unsigned long a3,
> > -				      int op_64_bit, int cpl)
> > +int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> > +			      unsigned long a0, unsigned long a1,
> > +			      unsigned long a2, unsigned long a3,
> > +			      int op_64_bit, int cpl, int ret_reg)
> >   {
> >   	unsigned long ret;
> >   
> > @@ -10086,15 +10086,18 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> >   
> >   out:
> >   	++vcpu->stat.hypercalls;
> > -	return ret;
> > +
> > +	if (!op_64_bit)
> > +		ret = (u32)ret;
> > +
> > +	kvm_register_write_raw(vcpu, ret_reg, ret);
> > +	return 1;
> >   }
> > -EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
> > +EXPORT_SYMBOL_GPL(____kvm_emulate_hypercall);
> >   
> >   int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >   {
> > -	unsigned long nr, a0, a1, a2, a3, ret;
> > -	int op_64_bit;
> > -	int cpl;
> > +	int r;
> >   
> >   	if (kvm_xen_hypercall_enabled(vcpu->kvm))
> >   		return kvm_xen_hypercall(vcpu);
> > @@ -10102,23 +10105,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >   	if (kvm_hv_hypercall_enabled(vcpu))
> >   		return kvm_hv_hypercall(vcpu);
> >   
> > -	nr = kvm_rax_read(vcpu);
> > -	a0 = kvm_rbx_read(vcpu);
> > -	a1 = kvm_rcx_read(vcpu);
> > -	a2 = kvm_rdx_read(vcpu);
> > -	a3 = kvm_rsi_read(vcpu);
> > -	op_64_bit = is_64_bit_hypercall(vcpu);
> > -	cpl = kvm_x86_call(get_cpl)(vcpu);
> > -
> > -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> > -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> > -		/* MAP_GPA tosses the request to the user space. */
> > +	r = __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
> > +				    is_64_bit_hypercall(vcpu),
> > +				    kvm_x86_call(get_cpl)(vcpu), RAX);
> Now, the register for return code of the hypercall can be specified.
> But in  ____kvm_emulate_hypercall(), the complete_userspace_io callback
> is hardcoded to complete_hypercall_exit(), which always set return code
> to RAX.
> 
> We can allow the caller to pass in the cui callback, or assign different
> version according to the input 'ret_reg'.  So that different callers can use
> different cui callbacks.  E.g., TDX needs to set return code to R10 in cui
> callback.
> 
> How about:
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index dba78f22ab27..0fba98685f42 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2226,13 +2226,15 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
>   int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>                                unsigned long a0, unsigned long a1,
>                                unsigned long a2, unsigned long a3,
> -                             int op_64_bit, int cpl, int ret_reg);
> +                             int op_64_bit, int cpl, int ret_reg,
> +                             int (*cui)(struct kvm_vcpu *vcpu));
> 

Does below (incremental diff based on Sean's) work?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 734dac079453..5131af97968d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10075,7 +10075,6 @@ int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu,
unsigned long nr,
                        vcpu->run->hypercall.flags |=
KVM_EXIT_HYPERCALL_LONG_MODE;
 
                WARN_ON_ONCE(vcpu->run->hypercall.flags &
KVM_EXIT_HYPERCALL_MBZ);
-               vcpu->arch.complete_userspace_io = complete_hypercall_exit;
                /* stat is incremented on completion. */
                return 0;
        }
@@ -10108,8 +10107,11 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
        r = __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
                                    is_64_bit_hypercall(vcpu),
                                    kvm_x86_call(get_cpl)(vcpu), RAX);
-       if (r <= 0)
+       if (r <= 0) {
+               if (!r)
+                       vcpu->arch.complete_userspace_io =
complete_hypercall_exit;
                return 0;
+       }
 
        return kvm_skip_emulated_instruction(vcpu);
 }
Binbin Wu Nov. 5, 2024, 1:07 a.m. UTC | #15
On 11/4/2024 5:55 PM, Huang, Kai wrote:
[...]
>>>    int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>>>    {
>>> -	unsigned long nr, a0, a1, a2, a3, ret;
>>> -	int op_64_bit;
>>> -	int cpl;
>>> +	int r;
>>>    
>>>    	if (kvm_xen_hypercall_enabled(vcpu->kvm))
>>>    		return kvm_xen_hypercall(vcpu);
>>> @@ -10102,23 +10105,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>>>    	if (kvm_hv_hypercall_enabled(vcpu))
>>>    		return kvm_hv_hypercall(vcpu);
>>>    
>>> -	nr = kvm_rax_read(vcpu);
>>> -	a0 = kvm_rbx_read(vcpu);
>>> -	a1 = kvm_rcx_read(vcpu);
>>> -	a2 = kvm_rdx_read(vcpu);
>>> -	a3 = kvm_rsi_read(vcpu);
>>> -	op_64_bit = is_64_bit_hypercall(vcpu);
>>> -	cpl = kvm_x86_call(get_cpl)(vcpu);
>>> -
>>> -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
>>> -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
>>> -		/* MAP_GPA tosses the request to the user space. */
>>> +	r = __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
>>> +				    is_64_bit_hypercall(vcpu),
>>> +				    kvm_x86_call(get_cpl)(vcpu), RAX);
>> Now, the register for return code of the hypercall can be specified.
>> But in  ____kvm_emulate_hypercall(), the complete_userspace_io callback
>> is hardcoded to complete_hypercall_exit(), which always set return code
>> to RAX.
>>
>> We can allow the caller to pass in the cui callback, or assign different
>> version according to the input 'ret_reg'.  So that different callers can use
>> different cui callbacks.  E.g., TDX needs to set return code to R10 in cui
>> callback.
>>
>> How about:
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index dba78f22ab27..0fba98685f42 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -2226,13 +2226,15 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
>>    int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>>                                 unsigned long a0, unsigned long a1,
>>                                 unsigned long a2, unsigned long a3,
>> -                             int op_64_bit, int cpl, int ret_reg);
>> +                             int op_64_bit, int cpl, int ret_reg,
>> +                             int (*cui)(struct kvm_vcpu *vcpu));
>>
> Does below (incremental diff based on Sean's) work?

Yes, it can work.

>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 734dac079453..5131af97968d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10075,7 +10075,6 @@ int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu,
> unsigned long nr,
>                          vcpu->run->hypercall.flags |=
> KVM_EXIT_HYPERCALL_LONG_MODE;
>   
>                  WARN_ON_ONCE(vcpu->run->hypercall.flags &
> KVM_EXIT_HYPERCALL_MBZ);
> -               vcpu->arch.complete_userspace_io = complete_hypercall_exit;
>                  /* stat is incremented on completion. */
>                  return 0;
>          }
> @@ -10108,8 +10107,11 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>          r = __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
>                                      is_64_bit_hypercall(vcpu),
>                                      kvm_x86_call(get_cpl)(vcpu), RAX);
> -       if (r <= 0)
> +       if (r <= 0) {
> +               if (!r)
> +                       vcpu->arch.complete_userspace_io =
> complete_hypercall_exit;
>                  return 0;
> +       }
>   
>          return kvm_skip_emulated_instruction(vcpu);
>   }
>
Sean Christopherson Nov. 5, 2024, 2:32 a.m. UTC | #16
On Mon, Nov 04, 2024, Kai Huang wrote:
> On Mon, 2024-11-04 at 16:49 +0800, Binbin Wu wrote:
> > On 11/2/2024 12:39 AM, Sean Christopherson wrote:
> > > On Fri, Nov 01, 2024, Kai Huang wrote:
> > > > On Thu, 2024-10-31 at 07:54 -0700, Sean Christopherson wrote:

...

> > > Lightly tested.  Assuming this works for TDX and passes testing, I'll post a
> > > mini-series next week.
> > > 
> > > --
> > > From: Sean Christopherson <seanjc@google.com>
> > > Date: Fri, 1 Nov 2024 09:04:00 -0700
> > > Subject: [PATCH] KVM: x86: Refactor __kvm_emulate_hypercall() to accept reg
> > >   names, not values
> > > 
> > > Rework __kvm_emulate_hypercall() to take the names of input and output
> > > (guest return value) registers, as opposed to taking the input values and
> > > returning the output value.  As part of the refactor, change the actual
> > > return value from __kvm_emulate_hypercall() to be KVM's de facto standard
> > > of '0' == exit to userspace, '1' == resume guest, and -errno == failure.
> > > 
> > > Using the return value for KVM's control flow eliminates the multiplexed
> > > return value, where '0' for KVM_HC_MAP_GPA_RANGE (and only that hypercall)
> > > means "exit to userspace".
> > > 
> > > Use the direct GPR accessors to read values to avoid the pointless marking
> > > of the registers as available, but use kvm_register_write_raw() for the
> > > guest return value so that the innermost helper doesn't need to multiplex
> > > its return value.  Using the generic kvm_register_write_raw() adds very
> > > minimal overhead, so as a one-off in a relatively slow path it's well
> > > worth the code simplification.
> > > 
> > > Suggested-by: Kai Huang <kai.huang@intel.com>
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---

...

> > > -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> > > -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> > > -		/* MAP_GPA tosses the request to the user space. */
> > > +	r = __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
> > > +				    is_64_bit_hypercall(vcpu),
> > > +				    kvm_x86_call(get_cpl)(vcpu), RAX);
> > Now, the register for return code of the hypercall can be specified.
> > But in  ____kvm_emulate_hypercall(), the complete_userspace_io callback
> > is hardcoded to complete_hypercall_exit(), which always set return code
> > to RAX.
> > 
> > We can allow the caller to pass in the cui callback, or assign different
> > version according to the input 'ret_reg'.  So that different callers can use
> > different cui callbacks.  E.g., TDX needs to set return code to R10 in cui
> > callback.
> > 
> > How about:
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index dba78f22ab27..0fba98685f42 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -2226,13 +2226,15 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
> >   int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> >                                unsigned long a0, unsigned long a1,
> >                                unsigned long a2, unsigned long a3,
> > -                             int op_64_bit, int cpl, int ret_reg);
> > +                             int op_64_bit, int cpl, int ret_reg,
> > +                             int (*cui)(struct kvm_vcpu *vcpu));
> > 
> 
> Does below (incremental diff based on Sean's) work?
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 734dac079453..5131af97968d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10075,7 +10075,6 @@ int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu,
> unsigned long nr,
>                         vcpu->run->hypercall.flags |=
> KVM_EXIT_HYPERCALL_LONG_MODE;
>  
>                 WARN_ON_ONCE(vcpu->run->hypercall.flags &
> KVM_EXIT_HYPERCALL_MBZ);
> -               vcpu->arch.complete_userspace_io = complete_hypercall_exit;
>                 /* stat is incremented on completion. */
>                 return 0;
>         }
> @@ -10108,8 +10107,11 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>         r = __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
>                                     is_64_bit_hypercall(vcpu),
>                                     kvm_x86_call(get_cpl)(vcpu), RAX);
> -       if (r <= 0)
> +       if (r <= 0) {
> +               if (!r)
> +                       vcpu->arch.complete_userspace_io =
> complete_hypercall_exit;
>                 return 0;
> +       }

I think I prefer Binbin's version, as it forces the caller to provide cui(), i.e.
makes it harder KVM to fail to handle the backend of the hypercall.

Side topic, unless I'm missing something, all of the instruction VM-Exits that are
piped through #VMGEXIT to svm_invoke_exit_handler() will make a bogus call to
kvm_skip_emulated_instruction().  It "works", but only because nothing ever looks
at the modified data.

At first, I thought it was just VMMCALL, and so was wondering if maybe we could
fix that wart too.  But it's *every* instruction, so it's probably out of scope.

The one thing I don't love about providing a separate cui() is that it means
duplicating the guts of the completion helper.  Ha!  But we can avoid that by
adding another macro (untested).

More macros/helpers is a bit ugly too, but I like the symmetry, and it will
definitely be easier to maintain.  E.g. if the completion phase needs to pivot
on the exact hypercall, then we can update common code and don't need to remember
to go update TDX too.

If no one objects and/or has a better idea, I'll splice together Binbin's patch
with this blob, and post a series tomorrow.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8e8ca6dab2b2..0b0fa9174000 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2179,6 +2179,16 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
        kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
 }
 
+#define kvm_complete_hypercall_exit(vcpu, ret_reg)                             \
+do {                                                                           \
+       u64 ret = (vcpu)->run->hypercall.ret;                                   \
+                                                                               \
+       if (!is_64_bit_mode(vcpu))                                              \
+               ret = (u32)ret;                                                 \
+       kvm_##ret_reg##_write(vcpu, ret);                                       \
+       ++(vcpu)->stat.hypercalls;                                              \
+} while (0)
+
 int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
                              unsigned long a0, unsigned long a1,
                              unsigned long a2, unsigned long a3,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 425a301911a6..aec79e132d3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9989,12 +9989,8 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
 
 static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 {
-       u64 ret = vcpu->run->hypercall.ret;
+       kvm_complete_hypercall_exit(vcpu, rax);
 
-       if (!is_64_bit_mode(vcpu))
-               ret = (u32)ret;
-       kvm_rax_write(vcpu, ret);
-       ++vcpu->stat.hypercalls;
        return kvm_skip_emulated_instruction(vcpu);
 }
Kai Huang Nov. 5, 2024, 9:20 a.m. UTC | #17
> > 
> 
> I think I prefer Binbin's version, as it forces the caller to provide cui(), i.e.
> makes it harder KVM to fail to handle the backend of the hypercall.

Fine to me.

[...]

> 
> The one thing I don't love about providing a separate cui() is that it means
> duplicating the guts of the completion helper.  Ha!  But we can avoid that by
> adding another macro (untested).
> 
> More macros/helpers is a bit ugly too, but I like the symmetry, and it will
> definitely be easier to maintain.  E.g. if the completion phase needs to pivot
> on the exact hypercall, then we can update common code and don't need to remember
> to go update TDX too.
> 
> If no one objects and/or has a better idea, I'll splice together Binbin's patch
> with this blob, and post a series tomorrow.
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8e8ca6dab2b2..0b0fa9174000 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2179,6 +2179,16 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
>         kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
>  }
>  
> +#define kvm_complete_hypercall_exit(vcpu, ret_reg)                             \
> +do {                                                                           \
> +       u64 ret = (vcpu)->run->hypercall.ret;                                   \
> +                                                                               \
> +       if (!is_64_bit_mode(vcpu))                                              \
> +               ret = (u32)ret;                                                 \
> +       kvm_##ret_reg##_write(vcpu, ret);                                       \
> +       ++(vcpu)->stat.hypercalls;                                              \
> +} while (0)
> +
>  int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>                               unsigned long a0, unsigned long a1,
>                               unsigned long a2, unsigned long a3,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 425a301911a6..aec79e132d3b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9989,12 +9989,8 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
>  
>  static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
>  {
> -       u64 ret = vcpu->run->hypercall.ret;
> +       kvm_complete_hypercall_exit(vcpu, rax);
>  
> -       if (!is_64_bit_mode(vcpu))
> -               ret = (u32)ret;
> -       kvm_rax_write(vcpu, ret);
> -       ++vcpu->stat.hypercalls;
>         return kvm_skip_emulated_instruction(vcpu);
>  }
>  

I think there's one issue here:

I assume macro kvm_complete_hypercall_exit(vcpu, ret_reg) will also be used by
TDX.  The issue is it calls !is_64_bit_mode(vcpu), which has below WARN():

        WARN_ON_ONCE(vcpu->arch.guest_state_protected);     

So IIUC TDX will hit this.

Btw, we have below (kinda) duplicated code in ____kvm_emulate_hypercall() too:

	++vcpu->stat.hypercalls;                                                      
                                                                                                                                                               
        if (!op_64_bit)                                                        
                ret = (u32)ret;                                                
                                                                                                                                                               
        kvm_register_write_raw(vcpu, ret_reg, ret);      

If we add a helper to do above, e.g.,

static void kvm_complete_hypercall_exit(struct kvm_vcpu *vcpu, int ret_reg, 
				        unsigned long ret, bool op_64_bit)
{
	if (!op_64_bit)
		ret = (u32)ret;
	kvm_register_write_raw(vcpu, ret_reg, ret);
	++vcpu->stat.hypercalls;
}

Then we can have

static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
{
	kvm_complete_hypercall_exit(vcpu, VCPU_REGS_RAX,
		vcpu->run->hypercall.ret, is_64_bit_mode(vcpu));

	return kvm_skip_emulated_instruction(vcpu);
}

TDX version can use:

	kvm_complete_hypercall_exit(vcpu, VCPU_REGS_R10,
		vcpu->run->hypercall.ret, true);

And ____kvm_emulate_hypercall() can be:

static int ____kvm_emulate_hypercall(vcpu, ...)
{
	...
out:
	kvm_complete_hypercall_exit(vcpu, ret_reg, ret, op_64_bit);
	return 1;
}
Binbin Wu Nov. 6, 2024, 8:32 a.m. UTC | #18
On 11/5/2024 5:20 PM, Huang, Kai wrote:
>> I think I prefer Binbin's version, as it forces the caller to provide cui(), i.e.
>> makes it harder KVM to fail to handle the backend of the hypercall.
> Fine to me.
>
> [...]
>
>> The one thing I don't love about providing a separate cui() is that it means
>> duplicating the guts of the completion helper.  Ha!  But we can avoid that by
>> adding another macro (untested).
>>
>> More macros/helpers is a bit ugly too, but I like the symmetry, and it will
>> definitely be easier to maintain.  E.g. if the completion phase needs to pivot
>> on the exact hypercall, then we can update common code and don't need to remember
>> to go update TDX too.
>>
>> If no one objects and/or has a better idea, I'll splice together Binbin's patch
>> with this blob, and post a series tomorrow.
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 8e8ca6dab2b2..0b0fa9174000 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -2179,6 +2179,16 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
>>          kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
>>   }
>>   
>> +#define kvm_complete_hypercall_exit(vcpu, ret_reg)                             \
>> +do {                                                                           \
>> +       u64 ret = (vcpu)->run->hypercall.ret;                                   \
>> +                                                                               \
>> +       if (!is_64_bit_mode(vcpu))                                              \
>> +               ret = (u32)ret;                                                 \
>> +       kvm_##ret_reg##_write(vcpu, ret);                                       \
>> +       ++(vcpu)->stat.hypercalls;                                              \
>> +} while (0)
>> +
>>   int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>>                                unsigned long a0, unsigned long a1,
>>                                unsigned long a2, unsigned long a3,
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 425a301911a6..aec79e132d3b 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -9989,12 +9989,8 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
>>   
>>   static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
>>   {
>> -       u64 ret = vcpu->run->hypercall.ret;
>> +       kvm_complete_hypercall_exit(vcpu, rax);
>>   
>> -       if (!is_64_bit_mode(vcpu))
>> -               ret = (u32)ret;
>> -       kvm_rax_write(vcpu, ret);
>> -       ++vcpu->stat.hypercalls;
>>          return kvm_skip_emulated_instruction(vcpu);
>>   }
>>   
> I think there's one issue here:
>
> I assume macro kvm_complete_hypercall_exit(vcpu, ret_reg) will also be used by
> TDX.  The issue is it calls !is_64_bit_mode(vcpu), which has below WARN():
>
>          WARN_ON_ONCE(vcpu->arch.guest_state_protected);
>
> So IIUC TDX will hit this.
>
> Btw, we have below (kinda) duplicated code in ____kvm_emulate_hypercall() too:
>
> 	++vcpu->stat.hypercalls;
>                                                                                                                                                                 
>          if (!op_64_bit)
>                  ret = (u32)ret;
>                                                                                                                                                                 
>          kvm_register_write_raw(vcpu, ret_reg, ret);
>
> If we add a helper to do above, e.g.,
>
> static void kvm_complete_hypercall_exit(struct kvm_vcpu *vcpu, int ret_reg,
> 				        unsigned long ret, bool op_64_bit)
> {
> 	if (!op_64_bit)
> 		ret = (u32)ret;
> 	kvm_register_write_raw(vcpu, ret_reg, ret);
> 	++vcpu->stat.hypercalls;
> }
If this is going to be the final version, it would be better to make it
public, and export the symbol, so that TDX code can reuse it.


>
> Then we can have
>
> static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
> {
> 	kvm_complete_hypercall_exit(vcpu, VCPU_REGS_RAX,
> 		vcpu->run->hypercall.ret, is_64_bit_mode(vcpu));
>
> 	return kvm_skip_emulated_instruction(vcpu);
> }
>
> TDX version can use:
>
> 	kvm_complete_hypercall_exit(vcpu, VCPU_REGS_R10,
> 		vcpu->run->hypercall.ret, true);
>
> And ____kvm_emulate_hypercall() can be:
>
> static int ____kvm_emulate_hypercall(vcpu, ...)
> {
> 	...
> out:
> 	kvm_complete_hypercall_exit(vcpu, ret_reg, ret, op_64_bit);
> 	return 1;
> }
>
Kai Huang Nov. 6, 2024, 8:54 a.m. UTC | #19
On Wed, 2024-11-06 at 16:32 +0800, Binbin Wu wrote:
> > static void kvm_complete_hypercall_exit(struct kvm_vcpu *vcpu, int ret_reg,
> >  				        unsigned long ret, bool op_64_bit)
> > {
> >  	if (!op_64_bit)
> >  		ret = (u32)ret;
> >  	kvm_register_write_raw(vcpu, ret_reg, ret);
> >  	++vcpu->stat.hypercalls;
> > }
> If this is going to be the final version, it would be better to make it
> public, and export the symbol, so that TDX code can reuse it.

Does making it 'static inline' and moving to kvm_host.h work?

kvm_register_write_raw(), and kvm_register_mark_dirty() which is called by
kvm_register_write_raw()), are both 'static inline'.  Seems we can get rid of
export by making it 'static inline'.
Binbin Wu Nov. 6, 2024, 10:11 a.m. UTC | #20
On 11/6/2024 4:54 PM, Huang, Kai wrote:
> On Wed, 2024-11-06 at 16:32 +0800, Binbin Wu wrote:
>>> static void kvm_complete_hypercall_exit(struct kvm_vcpu *vcpu, int ret_reg,
>>>   				        unsigned long ret, bool op_64_bit)
>>> {
>>>   	if (!op_64_bit)
>>>   		ret = (u32)ret;
>>>   	kvm_register_write_raw(vcpu, ret_reg, ret);
>>>   	++vcpu->stat.hypercalls;
>>> }
>> If this is going to be the final version, it would be better to make it
>> public, and export the symbol, so that TDX code can reuse it.
> Does making it 'static inline' and moving to kvm_host.h work?
It doesn't have a complete definition of struct kvm_vcpu in
arch/x86/include/asm/kvm_host.h, and the code is dereferencing
struct kvm_vcpu.
Also, the definition of kvm_register_write_raw() is in
arch/x86/kvm/kvm_cache_regs.h, which make it difficult to be called
there.

>
> kvm_register_write_raw(), and kvm_register_mark_dirty() which is called by
> kvm_register_write_raw()), are both 'static inline'.  Seems we can get rid of
> export by making it 'static inline'.
Sean Christopherson Nov. 6, 2024, 3:30 p.m. UTC | #21
On Wed, Nov 06, 2024, Binbin Wu wrote:
> On 11/6/2024 4:54 PM, Huang, Kai wrote:
> > On Wed, 2024-11-06 at 16:32 +0800, Binbin Wu wrote:
> > > > static void kvm_complete_hypercall_exit(struct kvm_vcpu *vcpu, int ret_reg,
> > > >   				        unsigned long ret, bool op_64_bit)
> > > > {
> > > >   	if (!op_64_bit)
> > > >   		ret = (u32)ret;
> > > >   	kvm_register_write_raw(vcpu, ret_reg, ret);
> > > >   	++vcpu->stat.hypercalls;
> > > > }
> > > If this is going to be the final version, it would be better to make it
> > > public, and export the symbol, so that TDX code can reuse it.
> > Does making it 'static inline' and moving to kvm_host.h work?
> It doesn't have a complete definition of struct kvm_vcpu in
> arch/x86/include/asm/kvm_host.h, and the code is dereferencing
> struct kvm_vcpu.
> Also, the definition of kvm_register_write_raw() is in
> arch/x86/kvm/kvm_cache_regs.h, which make it difficult to be called
> there.

A way around that would be to move the declarations from asm/kvm_host.h to x86.h,
and then kvm_complete_hypercall_exit() can be inlined (or not), without having to
deal with the kvm_host.h ordering issues.

IMO, KVM x86 would ideally put as much as possible in x86.h.  The vast majority
of KVM x86's exports are intended only for the vendor modules.  Declaring those
exports in kvm_host.h is effectively bleeding KVM internals to the broader kernel.

I'll go that route for the series, assuming it works as I intend :-)
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 966fb301d44b..e521f14ad2b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10220,8 +10220,9 @@  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	cpl = kvm_x86_call(get_cpl)(vcpu);
 
 	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
-	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
-		/* MAP_GPA tosses the request to the user space. */
+	/* Check !ret first to make sure nr is a valid KVM hypercall. */
+	if (!ret && user_exit_on_hypercall(vcpu->kvm, nr))
+		/* The hypercall is requested to exit to userspace. */
 		return 0;
 
 	if (!op_64_bit)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 6556a43f1915..bc1a9e080acb 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -561,4 +561,8 @@  int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
 			 unsigned int port, void *data,  unsigned int count,
 			 int in);
 
+static inline bool user_exit_on_hypercall(struct kvm *kvm, unsigned long hc_nr)
+{
+	return kvm->arch.hypercall_exit_enabled & BIT(hc_nr);
+}
 #endif