diff mbox series

[v19,037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific

Message ID 9bd868a287599eb2a854f6983f13b4500f47d2ae.1708933498.git.isaku.yamahata@intel.com (mailing list archive)
State New, archived
Headers show
Series [v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro | expand

Commit Message

Isaku Yamahata Feb. 26, 2024, 8:25 a.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX has its own limitation on the maximum number of vcpus that the guest
can accommodate.  Allow x86 kvm backend to implement its own KVM_ENABLE_CAP
handler and implement TDX backend for KVM_CAP_MAX_VCPUS.  user space VMM,
e.g. qemu, can specify its value instead of KVM_MAX_VCPUS.

When creating TD (TDH.MNG.INIT), the maximum number of vcpu needs to be
specified as struct td_params_struct.  and the value is a part of
measurement.  The user space has to specify the value somehow.  There are
two options for it.
option 1. API (Set KVM_CAP_MAX_VCPU) to specify the value (this patch)
option 2. Add max_vcpu as a parameter to initialize the guest.
          (TDG.MNG.INIT)

The flow will be create VM (KVM_CREATE_VM), create VCPUs (KVM_CREATE_VCPU),
initialize TDX part of VM and, initialize TDX part of vcpu.
Because the creation of vcpu is independent from TDX VM initialization,
Choose the option 1.
The work flow will be, KVM_CREATE_VM, set KVM_CAP_MAX_VCPU and,
KVM_CREATE_VCPU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
v18:
- use TDX instead of "x86, tdx" in subject
- use min(max_vcpu, TDX_MAX_VCPU) instead of
  min3(max_vcpu, KVM_MAX_VCPU, TDX_MAX_VCPU)
- make "if (KVM_MAX_VCPU) and if (TDX_MAX_VCPU)" into one if statement
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 ++
 arch/x86/include/asm/kvm_host.h    |  2 ++
 arch/x86/kvm/vmx/main.c            | 22 ++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.c             | 29 +++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h         |  5 +++++
 arch/x86/kvm/x86.c                 |  4 ++++
 6 files changed, 64 insertions(+)

Comments

Huang, Kai March 21, 2024, 11:36 p.m. UTC | #1
On 26/02/2024 9:25 pm, Yamahata, Isaku wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX has its own limitation on the maximum number of vcpus that the guest
> can accommodate.  

"limitation" -> "control".

"the guest" -> "a guest".

Allow x86 kvm backend to implement its own KVM_ENABLE_CAP
> handler and implement TDX backend for KVM_CAP_MAX_VCPUS.  

I am not sure we normally say "x86 KVM backend".  Just say "Allow KVM 
x86 ...".

user space VMM,
> e.g. qemu, can specify its value instead of KVM_MAX_VCPUS.

Grammar check.

> 
> When creating TD (TDH.MNG.INIT), the maximum number of vcpu needs to be
> specified as struct td_params_struct.  

'struct td_params_struct'??

Anyway, I don't think you need to mention such details.

and the value is a part of
> measurement.  The user space has to specify the value somehow.  

"and" -> "And" (grammar check please).

And add an empty line to start below as a new paragraph.

There are
> two options for it.
> option 1. API (Set KVM_CAP_MAX_VCPU) to specify the value (this patch)
> option 2. Add max_vcpu as a parameter to initialize the guest.
>            (TDG.MNG.INIT)

First of all, it seems to me that the two are not conflicting.

Based on the uapi/kvm.h:

   #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */

Currently KVM x86 doesn't allow to configure MAX_VCPU on VM-basis, but 
always reports KVM_MAX_VCPUS for _ALL_ VMs.  I.e., it doesn't support 
userspace to explicitly enable KVM_CAP_MAX_VCPUS for a given VM.

Now, if we allow the userspace to configure the MAX_VCPU for TDX guest 
(this could be a separate discussion in fact) due to attestation 
whatever, we need to support allowing userspace to configure MAX_VCPUS 
on VM-basis.

Therefore, option 1 isn't really an option to me, but is the thing that 
we _SHOULD_ do to support TDX.

So this pach should really just add "per-VM max vcpus" support for TDX, 
starting from:

	struct kvm_tdx {	/* or 'struct kvm_arch' ?? */
		...
		int max_vcpus;
	}

And in TDH.MNG.INIT, we need to manually check the MAX_VCPU specified in 
TD_PARAMS structure to make sure it matches to the record that we 
specified via KVM_CAP_MAX_VCPUS.

So how about:

"
TDX has its own mechanism to control the maximum number of VCPUs that 
the TDX guest can use.  When creating a TDX guest, the maximum number of 
vcpus needs to be passed to the TDX module as part of the measurement of 
the guest.

Because the value is part of the measurement, thus part of attestation, 
it better to allow the userspace to be able to configure it.  E.g. the 
users may want to precisely control the maximum number of vcpus their 
precious VMs can use.

The actual control itself must be done via the TDH.MNG.INIT SEAMCALL 
itself, where the number of maximum cpus is an input to the TDX module, 
but KVM needs to support the "per-VM number of maximum vcpus" and 
reflect that in the KVM_CAP_MAX_VCPUS.

Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but 
doesn't allow to enable KVM_CAP_MAX_VCPUS to configure the number of 
maximum vcpus on VM-basis.

Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.

The userspace-configured value then can be verified when KVM is actually 
creating the TDX guest.
"
Isaku Yamahata March 23, 2024, 1:13 a.m. UTC | #2
On Fri, Mar 22, 2024 at 12:36:40PM +1300,
"Huang, Kai" <kai.huang@intel.com> wrote:

> So how about:

Thanks for it. I'll update the commit message with some minor fixes.

> "
> TDX has its own mechanism to control the maximum number of VCPUs that the
> TDX guest can use.  When creating a TDX guest, the maximum number of vcpus
> needs to be passed to the TDX module as part of the measurement of the
> guest.
> 
> Because the value is part of the measurement, thus part of attestation, it
                                                                           ^'s
> better to allow the userspace to be able to configure it.  E.g. the users
                  the userspace to configure it                 ^,
> may want to precisely control the maximum number of vcpus their precious VMs
> can use.
> 
> The actual control itself must be done via the TDH.MNG.INIT SEAMCALL itself,
> where the number of maximum cpus is an input to the TDX module, but KVM
> needs to support the "per-VM number of maximum vcpus" and reflect that in
                        per-VM maximum number of vcpus
> the KVM_CAP_MAX_VCPUS.
> 
> Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't
> allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus
                                                     maximum number of vcpus
> on VM-basis.
> 
> Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.
> 
> The userspace-configured value then can be verified when KVM is actually
                                             used
> creating the TDX guest.
> "
Binbin Wu March 25, 2024, 8:42 a.m. UTC | #3
On 3/23/2024 9:13 AM, Isaku Yamahata wrote:
> On Fri, Mar 22, 2024 at 12:36:40PM +1300,
> "Huang, Kai" <kai.huang@intel.com> wrote:
>
>> So how about:
> Thanks for it. I'll update the commit message with some minor fixes.
>
>> "
>> TDX has its own mechanism to control the maximum number of VCPUs that the
>> TDX guest can use.  When creating a TDX guest, the maximum number of vcpus
>> needs to be passed to the TDX module as part of the measurement of the
>> guest.
>>
>> Because the value is part of the measurement, thus part of attestation, it
>                                                                             ^'s
>> better to allow the userspace to be able to configure it.  E.g. the users
>                    the userspace to configure it                 ^,
>> may want to precisely control the maximum number of vcpus their precious VMs
>> can use.
>>
>> The actual control itself must be done via the TDH.MNG.INIT SEAMCALL itself,
>> where the number of maximum cpus is an input to the TDX module, but KVM
>> needs to support the "per-VM number of maximum vcpus" and reflect that in
>                          per-VM maximum number of vcpus
>> the KVM_CAP_MAX_VCPUS.
>>
>> Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't
>> allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus
>                                                       maximum number of vcpus
>> on VM-basis.
>>
>> Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.
>>
>> The userspace-configured value then can be verified when KVM is actually
>                                               used

Here, "verified", I think Kai wanted to emphasize that the value of 
max_vcpus passed in via
KVM_TDX_INIT_VM should be checked against the value configured via
KVM_CAP_MAX_VCPUS?

Maybe "verified and used" ?

>> creating the TDX guest.
>> "
>
Huang, Kai March 25, 2024, 8:43 a.m. UTC | #4
>> Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't
>> allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus
>                                                       maximum number of vcpus
>> on VM-basis.
>>
>> Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.
>>
>> The userspace-configured value then can be verified when KVM is actually
>                                               used
>> creating the TDX guest.
>> "

I think we still have two options regarding to how 'max_vcpus' is 
handled in ioctl() to do TDH.MNG.INIT:

1) Just use the 'max_vcpus' done in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS),
2) Still pass the 'max_vcpus' as input, but KVM verifies it against the 
value that is saved in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS).

2) seems unnecessary, so I don't have objection to use 1).  But it seems 
we could still mention it in the changelog in that patch?
Isaku Yamahata March 25, 2024, 9:29 p.m. UTC | #5
On Mon, Mar 25, 2024 at 09:43:31PM +1300,
"Huang, Kai" <kai.huang@intel.com> wrote:

> 
> > > Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't
> > > allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus
> >                                                       maximum number of vcpus
> > > on VM-basis.
> > > 
> > > Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.
> > > 
> > > The userspace-configured value then can be verified when KVM is actually
> >                                               used
> > > creating the TDX guest.
> > > "
> 
> I think we still have two options regarding to how 'max_vcpus' is handled in
> ioctl() to do TDH.MNG.INIT:
> 
> 1) Just use the 'max_vcpus' done in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS),
> 2) Still pass the 'max_vcpus' as input, but KVM verifies it against the
> value that is saved in KVM_ENABLE_CAP(KVM_CAP_MAX_VCPUS).
> 
> 2) seems unnecessary, so I don't have objection to use 1).  But it seems we
> could still mention it in the changelog in that patch?

Sure, let me update the commit log.
Isaku Yamahata March 25, 2024, 9:31 p.m. UTC | #6
On Mon, Mar 25, 2024 at 04:42:36PM +0800,
Binbin Wu <binbin.wu@linux.intel.com> wrote:

> 
> 
> On 3/23/2024 9:13 AM, Isaku Yamahata wrote:
> > On Fri, Mar 22, 2024 at 12:36:40PM +1300,
> > "Huang, Kai" <kai.huang@intel.com> wrote:
> > 
> > > So how about:
> > Thanks for it. I'll update the commit message with some minor fixes.
> > 
> > > "
> > > TDX has its own mechanism to control the maximum number of VCPUs that the
> > > TDX guest can use.  When creating a TDX guest, the maximum number of vcpus
> > > needs to be passed to the TDX module as part of the measurement of the
> > > guest.
> > > 
> > > Because the value is part of the measurement, thus part of attestation, it
> >                                                                             ^'s
> > > better to allow the userspace to be able to configure it.  E.g. the users
> >                    the userspace to configure it                 ^,
> > > may want to precisely control the maximum number of vcpus their precious VMs
> > > can use.
> > > 
> > > The actual control itself must be done via the TDH.MNG.INIT SEAMCALL itself,
> > > where the number of maximum cpus is an input to the TDX module, but KVM
> > > needs to support the "per-VM number of maximum vcpus" and reflect that in
> >                          per-VM maximum number of vcpus
> > > the KVM_CAP_MAX_VCPUS.
> > > 
> > > Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but doesn't
> > > allow to enable KVM_CAP_MAX_VCPUS to configure the number of maximum vcpus
> >                                                       maximum number of vcpus
> > > on VM-basis.
> > > 
> > > Add "per-VM maximum vcpus" to KVM x86/TDX to accommodate TDX's needs.
> > > 
> > > The userspace-configured value then can be verified when KVM is actually
> >                                               used
> 
> Here, "verified", I think Kai wanted to emphasize that the value of
> max_vcpus passed in via
> KVM_TDX_INIT_VM should be checked against the value configured via
> KVM_CAP_MAX_VCPUS?
> 
> Maybe "verified and used" ?

Ok. I don't have strong opinion here.
Huang, Kai March 25, 2024, 10:47 p.m. UTC | #7
> > Here, "verified", I think Kai wanted to emphasize that the value of
> > max_vcpus passed in via KVM_TDX_INIT_VM should be checked against the
> > value configured via KVM_CAP_MAX_VCPUS?
> >
> > Maybe "verified and used" ?
> 
> Ok. I don't have strong opinion here.

It depends on how you implement that patch.  

If we don't pass 'max_vcpus' in that patch, there's nothing to verify really.
Sean Christopherson May 9, 2024, 4:35 p.m. UTC | #8
On Mon, Feb 26, 2024, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX has its own limitation on the maximum number of vcpus that the guest
> can accommodate.  Allow x86 kvm backend to implement its own KVM_ENABLE_CAP
> handler and implement TDX backend for KVM_CAP_MAX_VCPUS.  user space VMM,
> e.g. qemu, can specify its value instead of KVM_MAX_VCPUS.
> 
> When creating TD (TDH.MNG.INIT), the maximum number of vcpu needs to be
> specified as struct td_params_struct.  and the value is a part of
> measurement.  The user space has to specify the value somehow.  There are
> two options for it.
> option 1. API (Set KVM_CAP_MAX_VCPU) to specify the value (this patch)

When I suggested adding a capability[*], the intent was for the capability to
be generic, not buried in TDX code.  I can't think of any reason why this can't
be supported for all VMs on all architectures.  The only wrinkle is that it'll
require a separate capability since userspace needs to be able to detect that
KVM supports restricting the number of vCPUs, but that'll still be _less_ code.

[*] https://lore.kernel.org/all/YZVsnZ8e7cXls2P2@google.com

> +static int vt_max_vcpus(struct kvm *kvm)
> +{
> +	if (!kvm)
> +		return KVM_MAX_VCPUS;
> +
> +	if (is_td(kvm))
> +		return min(kvm->max_vcpus, TDX_MAX_VCPUS);
> +
> +	return kvm->max_vcpus;

This is _completely_ orthogonal to allowing userspace to restrict the maximum
number of vCPUs.  And unless I'm missing something, it's also ridiculous and
unnecessary at this time.  KVM x86 limits KVM_MAX_VCPUS to 4096:

  config KVM_MAX_NR_VCPUS
	int "Maximum number of vCPUs per KVM guest"
	depends on KVM
	range 1024 4096
	default 4096 if MAXSMP
	default 1024
	help

whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
a 16-bit unsigned value:

  #define TDX_MAX_VCPUS  (~(u16)0)

i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
And _if_ it becomes a problem, we don't necessarily need to have a different
_runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
being <= 64k.

So rather than add a bunch of pointless plumbing, just throw in

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 137d08da43c3..018d5b9eb93d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2488,6 +2488,9 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
                return -EOPNOTSUPP;
        }
 
+       BUILD_BUG_ON(CONFIG_KVM_MAX_NR_VCPUS <
+                    sizeof(td_params->max_vcpus) * BITS_PER_BYTE);
+
        td_params->max_vcpus = kvm->max_vcpus;
        td_params->attributes = init_vm->attributes;
        /* td_params->exec_controls = TDX_CONTROL_FLAG_NO_RBP_MOD; */
Huang, Kai May 9, 2024, 10:40 p.m. UTC | #9
On 10/05/2024 4:35 am, Sean Christopherson wrote:
> On Mon, Feb 26, 2024, isaku.yamahata@intel.com wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> TDX has its own limitation on the maximum number of vcpus that the guest
>> can accommodate.  Allow x86 kvm backend to implement its own KVM_ENABLE_CAP
>> handler and implement TDX backend for KVM_CAP_MAX_VCPUS.  user space VMM,
>> e.g. qemu, can specify its value instead of KVM_MAX_VCPUS.
>>
>> When creating TD (TDH.MNG.INIT), the maximum number of vcpu needs to be
>> specified as struct td_params_struct.  and the value is a part of
>> measurement.  The user space has to specify the value somehow.  There are
>> two options for it.
>> option 1. API (Set KVM_CAP_MAX_VCPU) to specify the value (this patch)
> 
> When I suggested adding a capability[*], the intent was for the capability to
> be generic, not buried in TDX code.  I can't think of any reason why this can't
> be supported for all VMs on all architectures.  The only wrinkle is that it'll
> require a separate capability since userspace needs to be able to detect that
> KVM supports restricting the number of vCPUs, but that'll still be _less_ code.
> 
> [*] https://lore.kernel.org/all/YZVsnZ8e7cXls2P2@google.com
> 
>> +static int vt_max_vcpus(struct kvm *kvm)
>> +{
>> +	if (!kvm)
>> +		return KVM_MAX_VCPUS;
>> +
>> +	if (is_td(kvm))
>> +		return min(kvm->max_vcpus, TDX_MAX_VCPUS);
>> +
>> +	return kvm->max_vcpus;
> 
> This is _completely_ orthogonal to allowing userspace to restrict the maximum
> number of vCPUs.  And unless I'm missing something, it's also ridiculous and
> unnecessary at this time.  

Right it's not necessary.  I think it can be reported as:

         case KVM_CAP_MAX_VCPUS:
                 r = KVM_MAX_VCPUS;
+               if (kvm)
+                       r = kvm->max_vcpus;
                 break;


>KVM x86 limits KVM_MAX_VCPUS to 4096:
> 
>    config KVM_MAX_NR_VCPUS
> 	int "Maximum number of vCPUs per KVM guest"
> 	depends on KVM
> 	range 1024 4096
> 	default 4096 if MAXSMP
> 	default 1024
> 	help
> 
> whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
> a 16-bit unsigned value:
> 
>    #define TDX_MAX_VCPUS  (~(u16)0)
> 
> i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
> And _if_ it becomes a problem, we don't necessarily need to have a different
> _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
> being <= 64k.

Actually later versions of TDX module (starting from 1.5 AFAICT), the 
module has a metadata field to report the maximum vCPUs that the module 
can support for all TDX guests.

> 
> So rather than add a bunch of pointless plumbing, just throw in
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 137d08da43c3..018d5b9eb93d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2488,6 +2488,9 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
>                  return -EOPNOTSUPP;
>          }
>   
> +       BUILD_BUG_ON(CONFIG_KVM_MAX_NR_VCPUS <
> +                    sizeof(td_params->max_vcpus) * BITS_PER_BYTE);
> +
>          td_params->max_vcpus = kvm->max_vcpus;
>          td_params->attributes = init_vm->attributes;
>          /* td_params->exec_controls = TDX_CONTROL_FLAG_NO_RBP_MOD; */
> 

Yeah the above could be helpful, but might not be necessary.

So the logic of updated patch is:

1) During module loading time, we grab the maximum vCPUs that the TDX 
module can support:


	/kvm_vm_ioctl_enable_cap
	 * TDX module may not support MD_FIELD_ID_MAX_VCPUS_PER_TD
	 * depending on its version.
	 */
	tdx_info->max_vcpus_per_td = U16_MAX;
	if (!tdx_sys_metadata_field_read(MD_FIELD_ID_MAX_VCPUS_PER_TD,
					&tmp))
	        tdx_info->max_vcpus_per_td = (u16)tmp;

2) When TDX guest is created, the userspace needs to call 
IOCTL(KVM_ENABLE_CAP) to configure the maximum vCPUs of the guest.  A 
new kvm_x86_ops::vm_enable_cap() is added because TDX has it's own 
limitation (metadata field) as mentioned above.

@@ -6827,6 +6829,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
         }
         default:
                 r = -EINVAL;
+               if (kvm_x86_ops.vm_enable_cap)
+                       r = static_call(kvm_x86_vm_enable_cap)(kvm,
+				 cap);

And we only allow the kvm->max_vcpus to be updated if it's a TDX guest 
in the vt_vm_enable_cap().  The reason is we want to avoid unnecessary 
change for normal VMX guests.

And the new kvm->max_vcpus cannot exceed the KVM_MAX_VCPUS and the 
tdx_info->max_vcpus_per_td:


+       case KVM_CAP_MAX_VCPUS: {
+               if (cap->flags || cap->args[0] == 0)
+                       return -EINVAL;
+               if (cap->args[0] > KVM_MAX_VCPUS ||
+                   cap->args[0] > tdx_info->max_vcpus_per_td)
+                       return -E2BIG;
+
+               mutex_lock(&kvm->lock);
+               if (kvm->created_vcpus)
+                       r = -EBUSY;
+               else {
+                       kvm->max_vcpus = cap->args[0];
+                       r = 0;
+               }
+               mutex_unlock(&kvm->lock);
+               break;
+       }

3) We just report kvm->max_vcpus when the userspace wants to check the 
KVM_CAP_MAX_VCPUS as shown in the beginning of my reply.

Does this make sense to you?

I am also pasting the new updated patch for your review (there are line 
wrapper issues unfortunately due to the simple copy/paste):

 From 797e439634d106f744517c97c5ea7887e494fc44 Mon Sep 17 00:00:00 2001
From: Isaku Yamahata <isaku.yamahata@intel.com>
Date: Thu, 16 Feb 2023 17:03:40 -0800
Subject: [PATCH] KVM: TDX: Allow userspace to configure maximum vCPUs 
for TDX guests

TDX has its own mechanism to control the maximum number of vCPUs that
the TDX guest can use.  When creating a TDX guest, the maximum number of
vCPUs of the guest needs to be passed to the TDX module as part of the
measurement of the guest.  Depending on TDX module's version, it may
also report the maximum vCPUs it can support for all TDX guests.

Because the maximum number of vCPUs is part of the measurement, thus
part of attestation, it's better to allow the userspace to be able to
configure it.  E.g. the users may want to precisely control the maximum
number of vCPUs their precious VMs can use.

The actual control itself must be done via the TDH.MNG.INIT SEAMCALL,
where the number of maximum cpus is part of the input to the TDX module,
but KVM needs to support the "per-VM maximum number of vCPUs" and
reflect that in the KVM_CAP_MAX_VCPUS.

Currently, the KVM x86 always reports KVM_MAX_VCPUS for all VMs but
doesn't allow to enable KVM_CAP_MAX_VCPUS to configure the number of
maximum vCPUs on VM-basis.

Add "per-VM maximum number of vCPUs" to KVM x86/TDX to accommodate TDX's
needs.

Specifically, use KVM's existing KVM_ENABLE_CAP IOCTL() to allow the
userspace to configure the maximum vCPUs by making KVM x86 support
enabling the KVM_CAP_MAX_VCPUS cap on VM-basis.

For that, add a new 'kvm_x86_ops::vm_enable_cap()' callback and call
it from kvm_vm_ioctl_enable_cap() as a placeholder to handle the
KVM_CAP_MAX_VCPUS for TDX guests (and other KVM_CAP_xx for TDX and/or
other VMs if needed in the future).

Implement the callback for TDX guest to check whether the maximum vCPUs
passed from usrspace can be supported by TDX, and if it can, override
Accordingly, in the KVM_CHECK_EXTENSION IOCTL(), change to return the
'struct kvm::max_vcpus' for a given VM for the KVM_CAP_MAX_VCPUS.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
v20:
- Drop max_vcpu ops to use kvm.max_vcpus
- Remove TDX_MAX_VCPUS (Kai)
- Use type cast (u16) instead of calling memcpy() when reading the
    'max_vcpus_per_td' (Kai)
- Improve change log and change patch title from "KVM: TDX: Make
   KVM_CAP_MAX_VCPUS backend specific" (Kai)
---
  arch/x86/include/asm/kvm-x86-ops.h |  1 +
  arch/x86/include/asm/kvm_host.h    |  1 +
  arch/x86/kvm/vmx/main.c            | 10 ++++++++
  arch/x86/kvm/vmx/tdx.c             | 40 ++++++++++++++++++++++++++++++
  arch/x86/kvm/vmx/x86_ops.h         |  5 ++++
  arch/x86/kvm/x86.c                 |  4 +++
  6 files changed, 61 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index bcb8302561f2..022b9eace3a5 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,6 +20,7 @@ KVM_X86_OP(hardware_disable)
  KVM_X86_OP(hardware_unsetup)
  KVM_X86_OP(has_emulated_msr)
  KVM_X86_OP(vcpu_after_set_cpuid)
+KVM_X86_OP_OPTIONAL(vm_enable_cap)
  KVM_X86_OP(vm_init)
  KVM_X86_OP_OPTIONAL(vm_destroy)
  KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h 
b/arch/x86/include/asm/kvm_host.h
index c461c2e57fcb..1d10e3d29533 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1639,6 +1639,7 @@ struct kvm_x86_ops {
         void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

         unsigned int vm_size;
+       int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
         int (*vm_init)(struct kvm *kvm);
         void (*vm_destroy)(struct kvm *kvm);

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 8e4aa8d15aec..686ca6348993 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -6,6 +6,7 @@
  #include "nested.h"
  #include "pmu.h"
  #include "tdx.h"
+#include "tdx_arch.h"

  static bool enable_tdx __ro_after_init;
  module_param_named(tdx, enable_tdx, bool, 0444);
@@ -33,6 +34,14 @@ static void vt_hardware_unsetup(void)
         vmx_hardware_unsetup();
  }

+static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+       if (is_td(kvm))
+               return tdx_vm_enable_cap(kvm, cap);
+
+       return -EINVAL;
+}
+

  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
  {
         if (!is_td(kvm))
@@ -63,6 +72,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
         .has_emulated_msr = vmx_has_emulated_msr,

         .vm_size = sizeof(struct kvm_vmx),
+       .vm_enable_cap = vt_vm_enable_cap,
         .vm_init = vmx_vm_init,
         .vm_destroy = vmx_vm_destroy,

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c7d849582d44..cdfc95904d6c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -34,6 +34,8 @@ struct tdx_info {
         u64 xfam_fixed0;
         u64 xfam_fixed1;

+       u16 max_vcpus_per_td;
+
         u16 num_cpuid_config;
         /* This must the last member. */
         DECLARE_FLEX_ARRAY(struct kvm_tdx_cpuid_config, cpuid_configs);
@@ -42,6 +44,35 @@ struct tdx_info {
  /* Info about the TDX module. */
  static struct tdx_info *tdx_info;

+int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+       int r;
+
+       switch (cap->cap) {
+       case KVM_CAP_MAX_VCPUS: {
+               if (cap->flags || cap->args[0] == 0)
+                       return -EINVAL;
+               if (cap->args[0] > KVM_MAX_VCPUS ||
+                   cap->args[0] > tdx_info->max_vcpus_per_td)
+                       return -E2BIG;
+
+               mutex_lock(&kvm->lock);
+               if (kvm->created_vcpus)
+                       r = -EBUSY;
+               else {
+                       kvm->max_vcpus = cap->args[0];
+                       r = 0;
+               }
+               mutex_unlock(&kvm->lock);
+               break;
+       }
+       default:
+               r = -EINVAL;
+               break;
+       }
+       return r;
+}
+
  static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
  {
         struct kvm_tdx_capabilities __user *user_caps;
@@ -129,6 +160,7 @@ static int __init tdx_module_setup(void)
                 u16 num_cpuid_config;
                 /* More member will come. */
         } st;
+       u64 tmp;
         int ret;
         u32 i;

@@ -167,6 +199,14 @@ static int __init tdx_module_setup(void)
                 return -ENOMEM;
         tdx_info->num_cpuid_config = st.num_cpuid_config;

+       /*
+        * TDX module may not support MD_FIELD_ID_MAX_VCPUS_PER_TD depending
+        * on its version.
+        */
+       tdx_info->max_vcpus_per_td = U16_MAX;
+       if (!tdx_sys_metadata_field_read(MD_FIELD_ID_MAX_VCPUS_PER_TD, 
&tmp))
+               tdx_info->max_vcpus_per_td = (u16)tmp;
+
         ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info);
         ret = tdx_sys_metadata_read(fields, ARRAY_SIZE(fields), tdx_info);
         if (ret)
                 goto error_out;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 9bc287a7efac..7c768e360bc6 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -139,11 +139,16 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
  int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
  void tdx_hardware_unsetup(void);

+int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
  #else
  static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { 
return -EOPNOTSUPP; }
  static inline void tdx_hardware_unsetup(void) {}

+static inline int tdx_vm_enable_cap(struct kvm *kvm, struct 
kvm_enable_cap *cap)
+{
+       return -EINVAL;
+};
  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { 
return -EOPNOTSUPP; }
  #endif

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ee8288a46d30..97ed4fe25964 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4776,6 +4776,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, 
long ext)
                 break;
         case KVM_CAP_MAX_VCPUS:
                 r = KVM_MAX_VCPUS;
+               if (kvm)
+                       r = kvm->max_vcpus;
                 break;
         case KVM_CAP_MAX_VCPU_ID:
                 r = KVM_MAX_VCPU_IDS;
@@ -6827,6 +6829,8 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
         }
         default:
                 r = -EINVAL;
+               if (kvm_x86_ops.vm_enable_cap)
+                       r = static_call(kvm_x86_vm_enable_cap)(kvm, cap);
                 break;
         }
         return r;
--
2.34.1
Huang, Kai May 9, 2024, 10:47 p.m. UTC | #10
> 
> Implement the callback for TDX guest to check whether the maximum vCPUs
> passed from usrspace can be supported by TDX, and if it can, override
> Accordingly, in the KVM_CHECK_EXTENSION IOCTL(), change to return the
> 'struct kvm::max_vcpus' for a given VM for the KVM_CAP_MAX_VCPUS.

Sorry some error during copy/paste.  The above should be:

Implement the callback for TDX guest to check whether the maximum vCPUs
passed from usrspace can be supported by TDX, and if it can, override
the 'struct kvm::max_vcpus'.  Leave VMX guests and all AMD guests
unsupported to avoid any side-effect for those VMs.

Accordingly, in the KVM_CHECK_EXTENSION IOCTL(), change to return the
'struct kvm::max_vcpus' for a given VM for the KVM_CAP_MAX_VCPUS.
Sean Christopherson May 9, 2024, 10:52 p.m. UTC | #11
On Fri, May 10, 2024, Kai Huang wrote:
> On 10/05/2024 4:35 am, Sean Christopherson wrote:
> > KVM x86 limits KVM_MAX_VCPUS to 4096:
> > 
> >    config KVM_MAX_NR_VCPUS
> > 	int "Maximum number of vCPUs per KVM guest"
> > 	depends on KVM
> > 	range 1024 4096
> > 	default 4096 if MAXSMP
> > 	default 1024
> > 	help
> > 
> > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
> > a 16-bit unsigned value:
> > 
> >    #define TDX_MAX_VCPUS  (~(u16)0)
> > 
> > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
> > And _if_ it becomes a problem, we don't necessarily need to have a different
> > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
> > being <= 64k.
> 
> Actually later versions of TDX module (starting from 1.5 AFAICT), the module
> has a metadata field to report the maximum vCPUs that the module can support
> for all TDX guests.

My quick glance at the 1.5 source shows that the limit is still effectively
0xffff, so again, who cares?  Assert on 0xffff compile time, and on the reported
max at runtime and simply refuse to use a TDX module that has dropped the minimum
below 0xffff.

> And we only allow the kvm->max_vcpus to be updated if it's a TDX guest in
> the vt_vm_enable_cap().  The reason is we want to avoid unnecessary change
> for normal VMX guests.

That's a frankly ridiculous reason to bury code in TDX.  Nothing is _forcing_
userspace to set KVM_CAP_MAX_VCPUS, i.e. there won't be any change to VMX VMs
unless userspace _wants_ there to be a change.
Huang, Kai May 9, 2024, 11:19 p.m. UTC | #12
On 10/05/2024 10:52 am, Sean Christopherson wrote:
> On Fri, May 10, 2024, Kai Huang wrote:
>> On 10/05/2024 4:35 am, Sean Christopherson wrote:
>>> KVM x86 limits KVM_MAX_VCPUS to 4096:
>>>
>>>     config KVM_MAX_NR_VCPUS
>>> 	int "Maximum number of vCPUs per KVM guest"
>>> 	depends on KVM
>>> 	range 1024 4096
>>> 	default 4096 if MAXSMP
>>> 	default 1024
>>> 	help
>>>
>>> whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
>>> a 16-bit unsigned value:
>>>
>>>     #define TDX_MAX_VCPUS  (~(u16)0)
>>>
>>> i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
>>> And _if_ it becomes a problem, we don't necessarily need to have a different
>>> _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
>>> being <= 64k.
>>
>> Actually later versions of TDX module (starting from 1.5 AFAICT), the module
>> has a metadata field to report the maximum vCPUs that the module can support
>> for all TDX guests.
> 
> My quick glance at the 1.5 source shows that the limit is still effectively
> 0xffff, so again, who cares?  Assert on 0xffff compile time, and on the reported
> max at runtime and simply refuse to use a TDX module that has dropped the minimum
> below 0xffff.

I need to double check why this metadata field was added.  My concern is 
in future module versions they may just low down the value.

But another way to handle is we can adjust code when that truly happens? 
Might not be ideal for stable kernel situation, though?

> 
>> And we only allow the kvm->max_vcpus to be updated if it's a TDX guest in
>> the vt_vm_enable_cap().  The reason is we want to avoid unnecessary change
>> for normal VMX guests.
> 
> That's a frankly ridiculous reason to bury code in TDX.  Nothing is _forcing_
> userspace to set KVM_CAP_MAX_VCPUS, i.e. there won't be any change to VMX VMs
> unless userspace _wants_ there to be a change.

Right.  Anyway allowing userspace to set KVM_CAP_MAX_VCPUS for non-TDX 
guests shouldn't have any issue.

The main reason to bury code in TDX is it needs to additionally check 
tdx_info->max_vcpus_per_td.  We can just do in common code if we avoid 
that TDX specific check.
Isaku Yamahata May 9, 2024, 11:55 p.m. UTC | #13
On Fri, May 10, 2024 at 11:19:44AM +1200,
"Huang, Kai" <kai.huang@intel.com> wrote:

> 
> 
> On 10/05/2024 10:52 am, Sean Christopherson wrote:
> > On Fri, May 10, 2024, Kai Huang wrote:
> > > On 10/05/2024 4:35 am, Sean Christopherson wrote:
> > > > KVM x86 limits KVM_MAX_VCPUS to 4096:
> > > > 
> > > >     config KVM_MAX_NR_VCPUS
> > > > 	int "Maximum number of vCPUs per KVM guest"
> > > > 	depends on KVM
> > > > 	range 1024 4096
> > > > 	default 4096 if MAXSMP
> > > > 	default 1024
> > > > 	help
> > > > 
> > > > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
> > > > a 16-bit unsigned value:
> > > > 
> > > >     #define TDX_MAX_VCPUS  (~(u16)0)
> > > > 
> > > > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
> > > > And _if_ it becomes a problem, we don't necessarily need to have a different
> > > > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
> > > > being <= 64k.
> > > 
> > > Actually later versions of TDX module (starting from 1.5 AFAICT), the module
> > > has a metadata field to report the maximum vCPUs that the module can support
> > > for all TDX guests.
> > 
> > My quick glance at the 1.5 source shows that the limit is still effectively
> > 0xffff, so again, who cares?  Assert on 0xffff compile time, and on the reported
> > max at runtime and simply refuse to use a TDX module that has dropped the minimum
> > below 0xffff.
> 
> I need to double check why this metadata field was added.  My concern is in
> future module versions they may just low down the value.

TD partitioning would reduce it much.


> But another way to handle is we can adjust code when that truly happens?
> Might not be ideal for stable kernel situation, though?
>
> > > And we only allow the kvm->max_vcpus to be updated if it's a TDX guest in
> > > the vt_vm_enable_cap().  The reason is we want to avoid unnecessary change
> > > for normal VMX guests.
> > 
> > That's a frankly ridiculous reason to bury code in TDX.  Nothing is _forcing_
> > userspace to set KVM_CAP_MAX_VCPUS, i.e. there won't be any change to VMX VMs
> > unless userspace _wants_ there to be a change.
> 
> Right.  Anyway allowing userspace to set KVM_CAP_MAX_VCPUS for non-TDX
> guests shouldn't have any issue.
> 
> The main reason to bury code in TDX is it needs to additionally check
> tdx_info->max_vcpus_per_td.  We can just do in common code if we avoid that
> TDX specific check.

So we can make it arch-independent.

When creating VM, we can set kvm->max_vcpu = tdx_info->max_vcpus_per_td by
tdx_vm_init().  The check can be common like
"if (new max_vcpu > kvm->max_vcpu) error".

Or we can add kvm->hard_max_vcpu or something,  arch-common code can have
"if (kvm->hard_max_vcpu && new max_vcpu > kvm->hard_max_vcpu) error".
Sean Christopherson May 10, 2024, 2:04 p.m. UTC | #14
On Thu, May 09, 2024, Isaku Yamahata wrote:
> On Fri, May 10, 2024 at 11:19:44AM +1200, Kai Huang <kai.huang@intel.com> wrote:
> > On 10/05/2024 10:52 am, Sean Christopherson wrote:
> > > On Fri, May 10, 2024, Kai Huang wrote:
> > > > On 10/05/2024 4:35 am, Sean Christopherson wrote:
> > > > > KVM x86 limits KVM_MAX_VCPUS to 4096:
> > > > > 
> > > > >     config KVM_MAX_NR_VCPUS
> > > > > 	int "Maximum number of vCPUs per KVM guest"
> > > > > 	depends on KVM
> > > > > 	range 1024 4096
> > > > > 	default 4096 if MAXSMP
> > > > > 	default 1024
> > > > > 	help
> > > > > 
> > > > > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
> > > > > a 16-bit unsigned value:
> > > > > 
> > > > >     #define TDX_MAX_VCPUS  (~(u16)0)
> > > > > 
> > > > > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
> > > > > And _if_ it becomes a problem, we don't necessarily need to have a different
> > > > > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
> > > > > being <= 64k.
> > > > 
> > > > Actually later versions of TDX module (starting from 1.5 AFAICT), the module
> > > > has a metadata field to report the maximum vCPUs that the module can support
> > > > for all TDX guests.
> > > 
> > > My quick glance at the 1.5 source shows that the limit is still effectively
> > > 0xffff, so again, who cares?  Assert on 0xffff compile time, and on the reported
> > > max at runtime and simply refuse to use a TDX module that has dropped the minimum
> > > below 0xffff.
> > 
> > I need to double check why this metadata field was added.  My concern is in
> > future module versions they may just low down the value.
> 
> TD partitioning would reduce it much.

That's still not a reason to plumb in what is effectively dead code.  Either
partitioning is opt-in, at which I suspect KVM will need yet more uAPI to express
the limitations to userspace, or the TDX-module is potentially breaking existing
use cases.
Huang, Kai May 14, 2024, 2:01 a.m. UTC | #15
On 11/05/2024 2:04 am, Sean Christopherson wrote:
> On Thu, May 09, 2024, Isaku Yamahata wrote:
>> On Fri, May 10, 2024 at 11:19:44AM +1200, Kai Huang <kai.huang@intel.com> wrote:
>>> On 10/05/2024 10:52 am, Sean Christopherson wrote:
>>>> On Fri, May 10, 2024, Kai Huang wrote:
>>>>> On 10/05/2024 4:35 am, Sean Christopherson wrote:
>>>>>> KVM x86 limits KVM_MAX_VCPUS to 4096:
>>>>>>
>>>>>>      config KVM_MAX_NR_VCPUS
>>>>>> 	int "Maximum number of vCPUs per KVM guest"
>>>>>> 	depends on KVM
>>>>>> 	range 1024 4096
>>>>>> 	default 4096 if MAXSMP
>>>>>> 	default 1024
>>>>>> 	help
>>>>>>
>>>>>> whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
>>>>>> a 16-bit unsigned value:
>>>>>>
>>>>>>      #define TDX_MAX_VCPUS  (~(u16)0)
>>>>>>
>>>>>> i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
>>>>>> And _if_ it becomes a problem, we don't necessarily need to have a different
>>>>>> _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
>>>>>> being <= 64k.
>>>>>
>>>>> Actually later versions of TDX module (starting from 1.5 AFAICT), the module
>>>>> has a metadata field to report the maximum vCPUs that the module can support
>>>>> for all TDX guests.
>>>>
>>>> My quick glance at the 1.5 source shows that the limit is still effectively
>>>> 0xffff, so again, who cares?  Assert on 0xffff compile time, and on the reported
>>>> max at runtime and simply refuse to use a TDX module that has dropped the minimum
>>>> below 0xffff.
>>>
>>> I need to double check why this metadata field was added.  My concern is in
>>> future module versions they may just low down the value.
>>
>> TD partitioning would reduce it much.
> 
> That's still not a reason to plumb in what is effectively dead code.  Either
> partitioning is opt-in, at which I suspect KVM will need yet more uAPI to express
> the limitations to userspace, or the TDX-module is potentially breaking existing
> use cases.

The 'max_vcpus_per_td' global metadata fields is static for the TDX 
module.  If the module supports the TD partitioning, it just reports 
some smaller value regardless whether we opt-in TDX partitioning or not.

I think the point is this 'max_vcpus_per_td' is TDX architectural thing 
and kernel should not make any assumption of the value of it.  The 
architectural behaviour is:

   If the module has this 'max_vcpus_per_td', software should read and
   use it; Otherwise software should treat it as U16_MAX.

Thus I don't think we will need a new uAPI (TDX specific presumably) 
just for TD partitioning.  And this doesn't break existing use cases.

In fact, this doesn't prevent us from making the KVM_CAP_MAX_VCPUS code 
generic, e.g., we can do below:

1) In tdx_vm_init() (called via KVM_VM_CREATE -> vt_vm_init()), we do:

	kvm->max_vcpus = min(kvm->max_vcpus,
				tdx_info->max_vcpus_per_td);

2) In kvm_vm_ioctl_enable_cap_generic(), we add support to handle 
KVM_CAP_MAX_VCPUS to have the generic code to do:

	if (new_max_vcpus > kvm->max_vcpus)
		return -EINVAL;

	kvm->max_vcpus = new_max_vcpus;

However this means we only allow "lowing down" the kvm->max_vcpus in the 
kvm_vm_ioctl_enable_cap_generic(KVM_CAP_MAX_VCPUS), but I think this is 
acceptable?

If it is a concern, alternatively, we can add a new 
'kvm->hard_max_vcpus' (or whatever makes sense), and set it in 
kvm_create_vm() right after kvm_arch_init_vm():

	r = kvm_arch_init_vm(kvm, type);
         if (r)
                 goto out_err_no_arch_destroy_vm;

	kvm->hard_max_vcpus = kvm->max_vcpus;

So it always contains "the max_vcpus limited by the ARCH 
hardware/firmware etc".

And in kvm_vm_ioctl_enable_cap_generic(), we check against 
kvm->hard_max_vcpus instead to get rid of the limitation of only 
allowing lowing down the kvm->max_vcpus.

But I don't think this is necessary at this stage.
Sean Christopherson May 29, 2024, 11:15 p.m. UTC | #16
On Tue, May 14, 2024, Kai Huang wrote:
> 
> 
> On 11/05/2024 2:04 am, Sean Christopherson wrote:
> > On Thu, May 09, 2024, Isaku Yamahata wrote:
> > > On Fri, May 10, 2024 at 11:19:44AM +1200, Kai Huang <kai.huang@intel.com> wrote:
> > > > On 10/05/2024 10:52 am, Sean Christopherson wrote:
> > > > > On Fri, May 10, 2024, Kai Huang wrote:
> > > > > > On 10/05/2024 4:35 am, Sean Christopherson wrote:
> > > > > > > KVM x86 limits KVM_MAX_VCPUS to 4096:
> > > > > > > 
> > > > > > >      config KVM_MAX_NR_VCPUS
> > > > > > > 	int "Maximum number of vCPUs per KVM guest"
> > > > > > > 	depends on KVM
> > > > > > > 	range 1024 4096
> > > > > > > 	default 4096 if MAXSMP
> > > > > > > 	default 1024
> > > > > > > 	help
> > > > > > > 
> > > > > > > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
> > > > > > > a 16-bit unsigned value:
> > > > > > > 
> > > > > > >      #define TDX_MAX_VCPUS  (~(u16)0)
> > > > > > > 
> > > > > > > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
> > > > > > > And _if_ it becomes a problem, we don't necessarily need to have a different
> > > > > > > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
> > > > > > > being <= 64k.
> > > > > > 
> > > > > > Actually later versions of TDX module (starting from 1.5 AFAICT), the module
> > > > > > has a metadata field to report the maximum vCPUs that the module can support
> > > > > > for all TDX guests.
> > > > > 
> > > > > My quick glance at the 1.5 source shows that the limit is still effectively
> > > > > 0xffff, so again, who cares?  Assert on 0xffff compile time, and on the reported
> > > > > max at runtime and simply refuse to use a TDX module that has dropped the minimum
> > > > > below 0xffff.
> > > > 
> > > > I need to double check why this metadata field was added.  My concern is in
> > > > future module versions they may just low down the value.
> > > 
> > > TD partitioning would reduce it much.
> > 
> > That's still not a reason to plumb in what is effectively dead code.  Either
> > partitioning is opt-in, at which I suspect KVM will need yet more uAPI to express
> > the limitations to userspace, or the TDX-module is potentially breaking existing
> > use cases.
> 
> The 'max_vcpus_per_td' global metadata fields is static for the TDX module.
> If the module supports the TD partitioning, it just reports some smaller
> value regardless whether we opt-in TDX partitioning or not.
> 
> I think the point is this 'max_vcpus_per_td' is TDX architectural thing and
> kernel should not make any assumption of the value of it.

It's not an assumption, it's a requirement.  And KVM already places requirements
on "hardware", e.g. kvm-intel.ko will refuse to load if the CPU doesn't support
a bare mimimum VMX feature set.  Refusing to enable TDX because max_vcpus_per_td
is unexpectedly low isn't fundamentally different than refusing to enable VMX
because IRQ window exiting is unsupported.

In the unlikely event there is a legitimate reason for max_vcpus_per_td being
less than KVM's minimum, then we can update KVM's minimum as needed.  But AFAICT,
that's purely theoretical at this point, i.e. this is all much ado about nothing.
Huang, Kai May 30, 2024, 12:21 p.m. UTC | #17
On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote:
> On Tue, May 14, 2024, Kai Huang wrote:
> > 
> > 
> > On 11/05/2024 2:04 am, Sean Christopherson wrote:
> > > On Thu, May 09, 2024, Isaku Yamahata wrote:
> > > > On Fri, May 10, 2024 at 11:19:44AM +1200, Kai Huang <kai.huang@intel.com> wrote:
> > > > > On 10/05/2024 10:52 am, Sean Christopherson wrote:
> > > > > > On Fri, May 10, 2024, Kai Huang wrote:
> > > > > > > On 10/05/2024 4:35 am, Sean Christopherson wrote:
> > > > > > > > KVM x86 limits KVM_MAX_VCPUS to 4096:
> > > > > > > > 
> > > > > > > >      config KVM_MAX_NR_VCPUS
> > > > > > > > 	int "Maximum number of vCPUs per KVM guest"
> > > > > > > > 	depends on KVM
> > > > > > > > 	range 1024 4096
> > > > > > > > 	default 4096 if MAXSMP
> > > > > > > > 	default 1024
> > > > > > > > 	help
> > > > > > > > 
> > > > > > > > whereas the limitation from TDX is apprarently simply due to TD_PARAMS taking
> > > > > > > > a 16-bit unsigned value:
> > > > > > > > 
> > > > > > > >      #define TDX_MAX_VCPUS  (~(u16)0)
> > > > > > > > 
> > > > > > > > i.e. it will likely be _years_ before TDX's limitation matters, if it ever does.
> > > > > > > > And _if_ it becomes a problem, we don't necessarily need to have a different
> > > > > > > > _runtime_ limit for TDX, e.g. TDX support could be conditioned on KVM_MAX_NR_VCPUS
> > > > > > > > being <= 64k.
> > > > > > > 
> > > > > > > Actually later versions of TDX module (starting from 1.5 AFAICT), the module
> > > > > > > has a metadata field to report the maximum vCPUs that the module can support
> > > > > > > for all TDX guests.
> > > > > > 
> > > > > > My quick glance at the 1.5 source shows that the limit is still effectively
> > > > > > 0xffff, so again, who cares?  Assert on 0xffff compile time, and on the reported
> > > > > > max at runtime and simply refuse to use a TDX module that has dropped the minimum
> > > > > > below 0xffff.
> > > > > 
> > > > > I need to double check why this metadata field was added.  My concern is in
> > > > > future module versions they may just low down the value.
> > > > 
> > > > TD partitioning would reduce it much.
> > > 
> > > That's still not a reason to plumb in what is effectively dead code.  Either
> > > partitioning is opt-in, at which I suspect KVM will need yet more uAPI to express
> > > the limitations to userspace, or the TDX-module is potentially breaking existing
> > > use cases.
> > 
> > The 'max_vcpus_per_td' global metadata fields is static for the TDX module.
> > If the module supports the TD partitioning, it just reports some smaller
> > value regardless whether we opt-in TDX partitioning or not.
> > 
> > I think the point is this 'max_vcpus_per_td' is TDX architectural thing and
> > kernel should not make any assumption of the value of it.
> 
> It's not an assumption, it's a requirement.  And KVM already places requirements
> on "hardware", e.g. kvm-intel.ko will refuse to load if the CPU doesn't support
> a bare mimimum VMX feature set.  Refusing to enable TDX because max_vcpus_per_td
> is unexpectedly low isn't fundamentally different than refusing to enable VMX
> because IRQ window exiting is unsupported.

OK.  I have no argument against this.

But I am not sure why we need to have such requirement.  See below.

> 
> In the unlikely event there is a legitimate reason for max_vcpus_per_td being
> less than KVM's minimum, then we can update KVM's minimum as needed.  But AFAICT,
> that's purely theoretical at this point, i.e. this is all much ado about nothing.

I am afraid we already have a legitimate case: TD partitioning.  Isaku
told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD
partitioning supported.  And again this is static, i.e., doesn't require
TD partitioning to be opt-in to low to 512.

So AFAICT this isn't a theoretical thing now.

Also, I want to say I was wrong about "MAX_VCPUS" in the TD_PARAMS is part
of attestation.  It is not.  TDREPORT dosen't include the "MAX_VCPUS", and
it is not involved in the calculation of the measurement of the guest.

Given "MAX_VCPUS" is not part of attestation, I think there's no need to
allow user to change kvm->max_vcpus by enabling KVM_ENABLE_CAP ioctl() for
KVM_CAP_MAX_VCPUS.

So we could just once for all adjust kvm->max_vcpus for TDX in the
tdx_vm_init() for TDX guest:

	kvm->max_vcpus = min(kvm->max_vcpus, tdx_info->max_vcpus_per_td);

AFAICT no other change is needed.

And in KVM_TDX_VM_INIT (where TDH.MNG.INIT is done) we can just use kvm-
>max_vcpus to fill the "MAX_VCPUS" in TD_PARAMS.
Sean Christopherson May 30, 2024, 11:12 p.m. UTC | #18
On Thu, May 30, 2024, Kai Huang wrote:
> On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote:
> > In the unlikely event there is a legitimate reason for max_vcpus_per_td being
> > less than KVM's minimum, then we can update KVM's minimum as needed.  But AFAICT,
> > that's purely theoretical at this point, i.e. this is all much ado about nothing.
> 
> I am afraid we already have a legitimate case: TD partitioning.  Isaku
> told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD
> partitioning supported.  And again this is static, i.e., doesn't require
> TD partitioning to be opt-in to low to 512.

So what's Intel's plan for use cases that creates TDs with >512 vCPUs?

> So AFAICT this isn't a theoretical thing now.
> 
> Also, I want to say I was wrong about "MAX_VCPUS" in the TD_PARAMS is part
> of attestation.  It is not.  TDREPORT dosen't include the "MAX_VCPUS", and
> it is not involved in the calculation of the measurement of the guest.
> 
> Given "MAX_VCPUS" is not part of attestation, I think there's no need to
> allow user to change kvm->max_vcpus by enabling KVM_ENABLE_CAP ioctl() for
> KVM_CAP_MAX_VCPUS.

Sure, but KVM would still need to advertise the reduced value for KVM_CAP_MAX_VCPUS
when queried via KVM_CHECK_EXTENSION.  And userspace needs to be conditioned to
do a VM-scoped check, not a system-scoped check.

> So we could just once for all adjust kvm->max_vcpus for TDX in the
> tdx_vm_init() for TDX guest:
> 
> 	kvm->max_vcpus = min(kvm->max_vcpus, tdx_info->max_vcpus_per_td);
> 
> AFAICT no other change is needed.
> 
> And in KVM_TDX_VM_INIT (where TDH.MNG.INIT is done) we can just use kvm-
> >max_vcpus to fill the "MAX_VCPUS" in TD_PARAMS.
Huang, Kai May 31, 2024, 12:36 a.m. UTC | #19
On Thu, 2024-05-30 at 16:12 -0700, Sean Christopherson wrote:
> On Thu, May 30, 2024, Kai Huang wrote:
> > On Wed, 2024-05-29 at 16:15 -0700, Sean Christopherson wrote:
> > > In the unlikely event there is a legitimate reason for max_vcpus_per_td being
> > > less than KVM's minimum, then we can update KVM's minimum as needed.  But AFAICT,
> > > that's purely theoretical at this point, i.e. this is all much ado about nothing.
> > 
> > I am afraid we already have a legitimate case: TD partitioning.  Isaku
> > told me the 'max_vcpus_per_td' is lowed to 512 for the modules with TD
> > partitioning supported.  And again this is static, i.e., doesn't require
> > TD partitioning to be opt-in to low to 512.
> 
> So what's Intel's plan for use cases that creates TDs with >512 vCPUs?

I don't think we have such use cases.  Let me double check with TDX module
guys.  


> 
> > So AFAICT this isn't a theoretical thing now.
> > 
> > Also, I want to say I was wrong about "MAX_VCPUS" in the TD_PARAMS is part
> > of attestation.  It is not.  TDREPORT dosen't include the "MAX_VCPUS", and
> > it is not involved in the calculation of the measurement of the guest.
> > 
> > Given "MAX_VCPUS" is not part of attestation, I think there's no need to
> > allow user to change kvm->max_vcpus by enabling KVM_ENABLE_CAP ioctl() for
> > KVM_CAP_MAX_VCPUS.
> 
> Sure, but KVM would still need to advertise the reduced value for KVM_CAP_MAX_VCPUS
> when queried via KVM_CHECK_EXTENSION.  And userspace needs to be conditioned to
> do a VM-scoped check, not a system-scoped check.

Oh yes.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 00b371d9a1ca..1b0dacc6b6c0 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -21,6 +21,8 @@  KVM_X86_OP(hardware_unsetup)
 KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
 KVM_X86_OP(is_vm_type_supported)
+KVM_X86_OP_OPTIONAL(max_vcpus);
+KVM_X86_OP_OPTIONAL(vm_enable_cap)
 KVM_X86_OP(vm_init)
 KVM_X86_OP_OPTIONAL(vm_destroy)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 37cda8aa07b6..cf8eb46b3a20 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1604,7 +1604,9 @@  struct kvm_x86_ops {
 	void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
 
 	bool (*is_vm_type_supported)(unsigned long vm_type);
+	int (*max_vcpus)(struct kvm *kvm);
 	unsigned int vm_size;
+	int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
 	int (*vm_init)(struct kvm *kvm);
 	void (*vm_destroy)(struct kvm *kvm);
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 082e82ce6580..e8a1a7533eea 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -6,6 +6,7 @@ 
 #include "nested.h"
 #include "pmu.h"
 #include "tdx.h"
+#include "tdx_arch.h"
 
 static bool enable_tdx __ro_after_init;
 module_param_named(tdx, enable_tdx, bool, 0444);
@@ -16,6 +17,17 @@  static bool vt_is_vm_type_supported(unsigned long type)
 		(enable_tdx && tdx_is_vm_type_supported(type));
 }
 
+static int vt_max_vcpus(struct kvm *kvm)
+{
+	if (!kvm)
+		return KVM_MAX_VCPUS;
+
+	if (is_td(kvm))
+		return min(kvm->max_vcpus, TDX_MAX_VCPUS);
+
+	return kvm->max_vcpus;
+}
+
 static __init int vt_hardware_setup(void)
 {
 	int ret;
@@ -39,6 +51,14 @@  static void vt_hardware_unsetup(void)
 	vmx_hardware_unsetup();
 }
 
+static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+	if (is_td(kvm))
+		return tdx_vm_enable_cap(kvm, cap);
+
+	return -EINVAL;
+}
+
 static int vt_vm_init(struct kvm *kvm)
 {
 	if (is_td(kvm))
@@ -77,7 +97,9 @@  struct kvm_x86_ops vt_x86_ops __initdata = {
 	.has_emulated_msr = vmx_has_emulated_msr,
 
 	.is_vm_type_supported = vt_is_vm_type_supported,
+	.max_vcpus = vt_max_vcpus,
 	.vm_size = sizeof(struct kvm_vmx),
+	.vm_enable_cap = vt_vm_enable_cap,
 	.vm_init = vt_vm_init,
 	.vm_destroy = vmx_vm_destroy,
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 816ccdb4bc41..ee015f3ce2c9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -56,6 +56,35 @@  struct tdx_info {
 /* Info about the TDX module. */
 static struct tdx_info *tdx_info;
 
+int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+	int r;
+
+	switch (cap->cap) {
+	case KVM_CAP_MAX_VCPUS: {
+		if (cap->flags || cap->args[0] == 0)
+			return -EINVAL;
+		if (cap->args[0] > KVM_MAX_VCPUS ||
+		    cap->args[0] > TDX_MAX_VCPUS)
+			return -E2BIG;
+
+		mutex_lock(&kvm->lock);
+		if (kvm->created_vcpus)
+			r = -EBUSY;
+		else {
+			kvm->max_vcpus = cap->args[0];
+			r = 0;
+		}
+		mutex_unlock(&kvm->lock);
+		break;
+	}
+	default:
+		r = -EINVAL;
+		break;
+	}
+	return r;
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f6c57ad44f80..0031a8d61589 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -139,12 +139,17 @@  int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 void tdx_hardware_unsetup(void);
 bool tdx_is_vm_type_supported(unsigned long type);
 
+int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
 static inline void tdx_hardware_unsetup(void) {}
 static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
 
+static inline int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+	return -EINVAL;
+};
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 #endif
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c459a5e9e520..3f16e9450d2f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4726,6 +4726,8 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		break;
 	case KVM_CAP_MAX_VCPUS:
 		r = KVM_MAX_VCPUS;
+		if (kvm_x86_ops.max_vcpus)
+			r = static_call(kvm_x86_max_vcpus)(kvm);
 		break;
 	case KVM_CAP_MAX_VCPU_ID:
 		r = KVM_MAX_VCPU_IDS;
@@ -6709,6 +6711,8 @@  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		break;
 	default:
 		r = -EINVAL;
+		if (kvm_x86_ops.vm_enable_cap)
+			r = static_call(kvm_x86_vm_enable_cap)(kvm, cap);
 		break;
 	}
 	return r;