diff mbox series

[v8,003/103] KVM: Refactor CPU compatibility check on module initialization

Message ID 4092a37d18f377003c6aebd9ced1280b0536c529.1659854790.git.isaku.yamahata@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM TDX basic feature support | expand

Commit Message

Isaku Yamahata Aug. 7, 2022, 10 p.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX module requires its initialization.  It requires VMX to be enabled.
Although there are several options of when to initialize it, the choice is
the initialization time of the KVM kernel module.  There is no usable
arch-specific hook for the TDX module to utilize during the KVM kernel module
initialization.  The code doesn't enable/disable hardware (VMX in TDX case)
during the kernel module initialization.  Add a hook for enabling hardware,
arch-specific initialization, and disabling hardware during KVM kernel
module initialization to make a room for TDX module initialization.  The
current KVM enables hardware when the first VM is created and disables
hardware when the last VM is destroyed.  When no VM is running, hardware is
disabled.  To follow these semantics, the kernel module initialization needs
to disable hardware. Opportunistically refactor the code to enable/disable
hardware.

Add hadware_enable_all() and hardware_disable_all() to kvm_init() and
introduce a new arch-specific callback function,
kvm_arch_post_hardware_enable_setup, for arch to do arch-specific
initialization that requires hardware_enable_all().  Opportunistically,
move kvm_arch_check_processor_compat() to to hardware_enabled_nolock().
TDX module initialization code will go into
kvm_arch_post_hardware_enable_setup().

This patch reorders some function calls as below from (*) (**) (A) and (B)
to (A) (B) and (*).  Here (A) and (B) depends on (*), but not (**).  By
code inspection, only mips and VMX has the code of (*).  No other
arch has empty (*).  So refactor mips and VMX and eliminate the
necessity hook for (*) instead of adding an unused hook.

Before this patch:
- Arch module initialization
  - kvm_init()
    - kvm_arch_init()
    - kvm_arch_check_processor_compat() on each CPUs
  - post-arch-specific initialization -- (*): (A) and (B) depends on this
  - post-arch-specific initialization -- (**): no dependency to (A) and (B)

- When creating/deleting the first/last VM
   - kvm_arch_hardware_enable() on each CPUs -- (A)
   - kvm_arch_hardware_disable() on each CPUs -- (B)

After this patch:
- Arch module initialization
  - kvm_init()
    - kvm_arch_init()
    - arch-specific initialization -- (*)
    - kvm_arch_check_processor_compat() on each CPUs
    - kvm_arch_hardware_enable() on each CPUs -- (A)
    - kvm_arch_hardware_disable() on each CPUs -- (B)
  - post-arch-specific initialization  -- (**)

- When creating/deleting the first/last VM (no logic change)
   - kvm_arch_hardware_enable() on each CPUs -- (A)
   - kvm_arch_hardware_disable() on each CPUs -- (B)

Code inspection result:
As long as I inspected, I found only mips and VMX have non-empty (*) or
non-empty (A) or (B).
x86: tested on a real machine
mips: compile test only
powerpc, s390, arm, riscv: code inspection only

- arch/mips/kvm/mips.c
  module init function, kvm_mips_init(), does some initialization after
  kvm_init().  Compile test only.

- arch/x86/kvm/x86.c
  - uses vm_list which is statically initialized.
  - static_call(kvm_x86_hardware_enable)();
    - SVM: (*) and (**) are empty.
    - VMX: initialize percpu variable loaded_vmcss_on_cpu that VMXON uses.

- arch/powerpc/kvm/powerpc.c
  kvm_arch_hardware_enable/disable() are nop

- arch/s390/kvm/kvm-s390.c
  kvm_arch_hardware_enable/disable() are nop

- arch/arm64/kvm/arm.c
  module init function, arm_init(), calls only kvm_init().
  (*) and (**) are empty

- arch/riscv/kvm/main.c
  module init function, riscv_kvm_init(), calls only kvm_init().
  (*) and (**) are empty

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/mips/kvm/mips.c     | 16 +++++++++++-----
 arch/x86/kvm/vmx/vmx.c   | 23 +++++++++++++++++++----
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 31 ++++++++++++++++++++++++-------
 4 files changed, 55 insertions(+), 16 deletions(-)

Comments

Binbin Wu Aug. 9, 2022, 7:16 a.m. UTC | #1
On 2022/8/8 6:00, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> TDX module requires its initialization.  It requires VMX to be enabled.
> Although there are several options of when to initialize it, the choice is
> the initialization time of the KVM kernel module.  There is no usable
> arch-specific hook for the TDX module to utilize during the KVM kernel module
> initialization.  The code doesn't enable/disable hardware (VMX in TDX case)
> during the kernel module initialization.  Add a hook for enabling hardware,
> arch-specific initialization, and disabling hardware during KVM kernel
> module initialization to make a room for TDX module initialization.  The
> current KVM enables hardware when the first VM is created and disables
> hardware when the last VM is destroyed.  When no VM is running, hardware is
> disabled.  To follow these semantics, the kernel module initialization needs
> to disable hardware. Opportunistically refactor the code to enable/disable
> hardware.
>
> Add hadware_enable_all() and hardware_disable_all() to kvm_init() and
> introduce a new arch-specific callback function,
> kvm_arch_post_hardware_enable_setup, for arch to do arch-specific
> initialization that requires hardware_enable_all().  Opportunistically,
> move kvm_arch_check_processor_compat() to to hardware_enabled_nolock().
> TDX module initialization code will go into
> kvm_arch_post_hardware_enable_setup().
>
> This patch reorders some function calls as below from (*) (**) (A) and (B)
> to (A) (B) and (*).  Here (A) and (B) depends on (*), but not (**).  By
> code inspection, only mips and VMX has the code of (*).  No other
No other or other?


> arch has empty (*).  So refactor mips and VMX and eliminate the
> necessity hook for (*) instead of adding an unused hook.
>
> Before this patch:
> - Arch module initialization
>    - kvm_init()
>      - kvm_arch_init()
>      - kvm_arch_check_processor_compat() on each CPUs
>    - post-arch-specific initialization -- (*): (A) and (B) depends on this
>    - post-arch-specific initialization -- (**): no dependency to (A) and (B)
>
> - When creating/deleting the first/last VM
>     - kvm_arch_hardware_enable() on each CPUs -- (A)
>     - kvm_arch_hardware_disable() on each CPUs -- (B)
>
> After this patch:
> - Arch module initialization
>    - kvm_init()
>      - kvm_arch_init()
>      - arch-specific initialization -- (*)
>      - kvm_arch_check_processor_compat() on each CPUs
>      - kvm_arch_hardware_enable() on each CPUs -- (A)


Maybe also put the new added kvm_arch_post_hardware_enable_setup here?
After all, this is the purpose of this patch.


>      - kvm_arch_hardware_disable() on each CPUs -- (B)
>    - post-arch-specific initialization  -- (**)
>
> - When creating/deleting the first/last VM (no logic change)
>     - kvm_arch_hardware_enable() on each CPUs -- (A)
>     - kvm_arch_hardware_disable() on each CPUs -- (B)
>
Huang, Kai Aug. 11, 2022, 11:16 a.m. UTC | #2
First of all, I think the patch title can be improved.  "refactor CPU
compatibility check on module initialization" isn't the purpose of this patch. 
It is just a bonus.  The title should reflect the main purpose (or behaviour) of
this patch:

	KVM: Temporarily enable hardware on all cpus during module loading time

On Sun, 2022-08-07 at 15:00 -0700, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX module requires its initialization.  It requires VMX to be enabled.
> Although there are several options of when to initialize it, the choice is
> the initialization time of the KVM kernel module.  There is no usable
> arch-specific hook for the TDX module to utilize during the KVM kernel module
> initialization.  The code doesn't enable/disable hardware (VMX in TDX case)
> during the kernel module initialization.  Add a hook for enabling hardware,
> arch-specific initialization, and disabling hardware during KVM kernel
> module initialization to make a room for TDX module initialization.  
> 

The hook is only for arch-specific initialization.  It doesn't do
hardware_enable_all() and hardware_disable_all().

> The
> current KVM enables hardware when the first VM is created and disables
> hardware when the last VM is destroyed.  When no VM is running, hardware is
> disabled.  To follow these semantics, the kernel module initialization needs
> to disable hardware. Opportunistically refactor the code to enable/disable
> hardware.
> 
> Add hadware_enable_all() and hardware_disable_all() to kvm_init() and
> introduce a new arch-specific callback function,
> kvm_arch_post_hardware_enable_setup, for arch to do arch-specific
> initialization that requires hardware_enable_all().  Opportunistically,
> move kvm_arch_check_processor_compat() to to hardware_enabled_nolock().
> TDX module initialization code will go into
> kvm_arch_post_hardware_enable_setup().

The remaining paragraphs are very hard to follow.  IMHO they are too details and
are not changelog material.  If you can obtain Reviewed-by from
reviewer/maintainer for specific arch, then it implies the code is fine to that
arch, and you don't need those details below in changelog.

So perhaps just say something like:

With doing hardware_enable_all() and hardware_disable_all() in kvm_init(), some
arch-specific code which must be done before them must also be moved before
kvm_init().  Only VMX and MIPS have such code:
- For VMX, per-cpu variable loaded_vmcss_on_cpu must be initialized before
hardware_disable_all().  Introduce vmx_early_init() to do it and call
vmx_eraly_init() before kvm_init() in vmx_init().
- For MIPS, move kvm_init() to the end of kvm_mips_init() (or whatever you
like...).


> 
> This patch reorders some function calls as below from (*) (**) (A) and (B)
> to (A) (B) and (*).  Here (A) and (B) depends on (*), but not (**).  By
> code inspection, only mips and VMX has the code of (*).  No other
> arch has empty (*).  So refactor mips and VMX and eliminate the
> necessity hook for (*) instead of adding an unused hook.
> 
> Before this patch:
> - Arch module initialization
>   - kvm_init()
>     - kvm_arch_init()
>     - kvm_arch_check_processor_compat() on each CPUs
>   - post-arch-specific initialization -- (*): (A) and (B) depends on this
>   - post-arch-specific initialization -- (**): no dependency to (A) and (B)
> 
> - When creating/deleting the first/last VM
>    - kvm_arch_hardware_enable() on each CPUs -- (A)
>    - kvm_arch_hardware_disable() on each CPUs -- (B)
> 
> After this patch:
> - Arch module initialization
>   - kvm_init()
>     - kvm_arch_init()
>     - arch-specific initialization -- (*)
>     - kvm_arch_check_processor_compat() on each CPUs
>     - kvm_arch_hardware_enable() on each CPUs -- (A)
>     - kvm_arch_hardware_disable() on each CPUs -- (B)
>   - post-arch-specific initialization  -- (**)
> 
> - When creating/deleting the first/last VM (no logic change)
>    - kvm_arch_hardware_enable() on each CPUs -- (A)
>    - kvm_arch_hardware_disable() on each CPUs -- (B)
> 
> Code inspection result:
> As long as I inspected, I found only mips and VMX have non-empty (*) or
> non-empty (A) or (B).
> x86: tested on a real machine
> mips: compile test only
> powerpc, s390, arm, riscv: code inspection only
> 
> - arch/mips/kvm/mips.c
>   module init function, kvm_mips_init(), does some initialization after
>   kvm_init().  Compile test only.
> 
> - arch/x86/kvm/x86.c
>   - uses vm_list which is statically initialized.
>   - static_call(kvm_x86_hardware_enable)();
>     - SVM: (*) and (**) are empty.
>     - VMX: initialize percpu variable loaded_vmcss_on_cpu that VMXON uses.
> 
> - arch/powerpc/kvm/powerpc.c
>   kvm_arch_hardware_enable/disable() are nop
> 
> - arch/s390/kvm/kvm-s390.c
>   kvm_arch_hardware_enable/disable() are nop
> 
> - arch/arm64/kvm/arm.c
>   module init function, arm_init(), calls only kvm_init().
>   (*) and (**) are empty
> 
> - arch/riscv/kvm/main.c
>   module init function, riscv_kvm_init(), calls only kvm_init().
>   (*) and (**) are empty
> 
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/mips/kvm/mips.c     | 16 +++++++++++-----
>  arch/x86/kvm/vmx/vmx.c   | 23 +++++++++++++++++++----
>  include/linux/kvm_host.h |  1 +
>  virt/kvm/kvm_main.c      | 31 ++++++++++++++++++++++++-------
>  4 files changed, 55 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index 092d09fb6a7e..fd7339cff57c 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -1642,12 +1642,11 @@ static int __init kvm_mips_init(void)
>  		return -EOPNOTSUPP;
>  	}
>  
> +	/*
> +	 * kvm_init() calls kvm_arch_hardware_enable/disable().  The early
> +	 * initialization is needed before calling kvm_init().
> +	 */

Sorry I take this back.  It's upto MIPS reviewer/maintainer to decide whether we
need a comment here.

>  	ret = kvm_mips_entry_setup();
> -	if (ret)
> -		return ret;
> -
> -	ret = kvm_init(NULL, sizeof(struct kvm_vcpu), 0, THIS_MODULE);
> -
>  	if (ret)
>  		return ret;
>  
> @@ -1656,6 +1655,13 @@ static int __init kvm_mips_init(void)
>  
>  	register_die_notifier(&kvm_mips_csr_die_notifier);
>  
> +	ret = kvm_init(NULL, sizeof(struct kvm_vcpu), 0, THIS_MODULE);
> +
> +	if (ret) {
> +		unregister_die_notifier(&kvm_mips_csr_die_notifier);
> +		return ret;
> +	}
> +
>  	return 0;
>  }
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c5637a4c1c43..4d7e1073984d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -8433,6 +8433,23 @@ static void vmx_exit(void)
>  }
>  module_exit(vmx_exit);
>  
> +/*
> + * Early initialization before kvm_init() so that vmx_hardware_enable/disable()
> + * can work.
> + */
> +static void __init vmx_init_early(void)
> +{
> +	int cpu;
> +
> +	/*
> +	 * vmx_hardware_disable() accesses loaded_vmcss_on_cpu list.
> +	 * Initialize the variable before kvm_init() that calls
> +	 * vmx_hardware_enable/disable().
> +	 */
> +	for_each_possible_cpu(cpu)
> +		INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> +}
> +
>  static int __init vmx_init(void)
>  {
>  	int r, cpu;
> @@ -8470,6 +8487,7 @@ static int __init vmx_init(void)
>  	}
>  #endif
>  
> +	vmx_init_early();
>  	r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),
>  		     __alignof__(struct vcpu_vmx), THIS_MODULE);
>  	if (r)
> @@ -8490,11 +8508,8 @@ static int __init vmx_init(void)
>  
>  	vmx_setup_fb_clear_ctrl();
>  
> -	for_each_possible_cpu(cpu) {
> -		INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> -
> +	for_each_possible_cpu(cpu)
>  		pi_init_cpu(cpu);
> -	}
>  
>  #ifdef CONFIG_KEXEC_CORE
>  	rcu_assign_pointer(crash_vmclear_loaded_vmcss,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9643b8eadefe..63d4e3ce3e53 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1437,6 +1437,7 @@ static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
>  int kvm_arch_hardware_enable(void);
>  void kvm_arch_hardware_disable(void);
>  int kvm_arch_hardware_setup(void *opaque);
> +int kvm_arch_post_hardware_enable_setup(void *opaque);
>  void kvm_arch_hardware_unsetup(void);
>  int kvm_arch_check_processor_compat(void);
>  int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0f5767e5ae45..a43fdbfaede2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4995,8 +4995,13 @@ static void hardware_enable_nolock(void *junk)
>  
>  	cpumask_set_cpu(cpu, cpus_hardware_enabled);
>  
> +	r = kvm_arch_check_processor_compat();
> +	if (r)
> +		goto out;
> +
>  	r = kvm_arch_hardware_enable();
>  
> +out:
>  	if (r) {
>  		cpumask_clear_cpu(cpu, cpus_hardware_enabled);
>  		atomic_inc(&hardware_enable_failed);
> @@ -5793,9 +5798,9 @@ void kvm_unregister_perf_callbacks(void)
>  }
>  #endif
>  
> -static void check_processor_compat(void *rtn)
> +__weak int kvm_arch_post_hardware_enable_setup(void *opaque)
>  {
> -	*(int *)rtn = kvm_arch_check_processor_compat();
> +	return 0;
>  }
>  
>  int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
> @@ -5828,11 +5833,23 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
>  	if (r < 0)
>  		goto out_free_1;
>  
> -	for_each_online_cpu(cpu) {
> -		smp_call_function_single(cpu, check_processor_compat, &r, 1);
> -		if (r < 0)
> -			goto out_free_2;
> -	}
> +	/* hardware_enable_nolock() checks CPU compatibility on each CPUs. */
> +	r = hardware_enable_all();
> +	if (r)
> +		goto out_free_2;
> +	/*
> +	 * Arch specific initialization that requires to enable virtualization
> +	 * feature.  e.g. TDX module initialization requires VMXON on all
> +	 * present CPUs.
> +	 */
> +	kvm_arch_post_hardware_enable_setup(opaque);
> +	/*
> +	 * Make hardware disabled after the KVM module initialization.  KVM
> +	 * enables hardware when the first KVM VM is created and disables
> +	 * hardware when the last KVM VM is destroyed.  When no KVM VM is
> +	 * running, hardware is disabled.  Keep that semantics.
> +	 */

Except the first sentence, the remaining sentences are more like changelog
material.  Perhaps just say something below to be more specific on the purpose:

	/*
	 * Disable hardware on all cpus so that out-of-tree drivers which
	 * also use hardware-assisted virtualization (such as virtualbox
	 * kernel module) can still be loaded when KVM is loaded.
	 */

> +	hardware_disable_all();
>  
>  	r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
>  				      kvm_starting_cpu, kvm_dying_cpu);


Anyway (although I still think it's better to introduce the __weak
kvm_arch_post_hardware_enable_setup() in later patch together with the VMX
version to support TDX), with above comments addressed:

Acked-by: Kai Huang <kai.huang@intel.com>
Sean Christopherson Aug. 11, 2022, 5:39 p.m. UTC | #3
+Will (for arm crud)

On Thu, Aug 11, 2022, Huang, Kai wrote:
> First of all, I think the patch title can be improved.  "refactor CPU
> compatibility check on module initialization" isn't the purpose of this patch. 
> It is just a bonus.  The title should reflect the main purpose (or behaviour) of
> this patch:
> 
> 	KVM: Temporarily enable hardware on all cpus during module loading time

...

> > +	/* hardware_enable_nolock() checks CPU compatibility on each CPUs. */
> > +	r = hardware_enable_all();
> > +	if (r)
> > +		goto out_free_2;
> > +	/*
> > +	 * Arch specific initialization that requires to enable virtualization
> > +	 * feature.  e.g. TDX module initialization requires VMXON on all
> > +	 * present CPUs.
> > +	 */
> > +	kvm_arch_post_hardware_enable_setup(opaque);
> > +	/*
> > +	 * Make hardware disabled after the KVM module initialization.  KVM
> > +	 * enables hardware when the first KVM VM is created and disables
> > +	 * hardware when the last KVM VM is destroyed.  When no KVM VM is
> > +	 * running, hardware is disabled.  Keep that semantics.
> > +	 */
> 
> Except the first sentence, the remaining sentences are more like changelog
> material.  Perhaps just say something below to be more specific on the purpose:
> 
> 	/*
> 	 * Disable hardware on all cpus so that out-of-tree drivers which
> 	 * also use hardware-assisted virtualization (such as virtualbox
> 	 * kernel module) can still be loaded when KVM is loaded.
> 	 */
> 
> > +	hardware_disable_all();
> >  
> >  	r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
> >  				      kvm_starting_cpu, kvm_dying_cpu);

I've been poking at the "hardware enable" code this week for other reasons, and
have come to the conclusion that the current implementation is a mess.

x86 overloads "hardware enable" to do three different things:

  1. actually enable hardware
  2. snapshot per-CPU MSR value for user-return MSRs
  3. handle unstable TSC _for existing VMs_ on suspend+resume and/or CPU hotplug

#2 and #3 have nothing to do with enabling hardware, kvm_arch_hardware_enable() just
so happens to be called in a superset of what is needed for dealing with unstable TSCs,
and AFAICT the user-return MSRs is simply a historical wart.  The user-return MSRs
code is subtly very, very nasty, as it means that KVM snaphots MSRs from IRQ context,
e.g. if an out-of-tree module is running VMs, the IRQ can interrupt the _guest_ and
cause KVM to snapshot guest registers.  VMX and SVM kinda sorta guard against this
by refusing to load if VMX/SVM are already enabled, but it's not foolproof.

Eww, and #3 is broken.  If CPU (un)hotplug collides with kvm_destroy_vm() or
kvm_create_vm(), kvm_arch_hardware_enable() could explode due to vm_list being
modified while it's being walked.

Of course, that path is broken for other reasons too, e.g. needs to prevent CPUs
from going on/off-line when KVM is enabling hardware.
https://lore.kernel.org/all/20220216031528.92558-7-chao.gao@intel.com

arm64 is also quite evil and circumvents KVM's hardware enabling logic to some extent.
kvm_arch_init() => init_subsystems() unconditionally enables hardware, and for pKVM
_leaves_ hardware enabled.  And then hyp_init_cpu_pm_notifier() disables/enables
hardware across lower power enter+exit, except if pKVM is enabled.  The icing on
the cake is "disabling" hardware doesn't even do anything (AFAICT) if the kernel is
running at EL2 (which I think is nVHE + not-pKVM?).

PPC apparently didn't want to be left out of the party, and despite having a nop
for kvm_arch_hardware_disable(), it does its own "is KVM enabled" tracking (see
kvm_hv_vm_(de)activated()).  At least PPC gets the cpus_read_(un)lock() stuff right...

MIPS doesn't appear to have any shenanigans, but kvm_vz_hardware_enable() appears
to be a "heavy" operation, i.e. ideally not something that should be done spuriously.

s390 and PPC are the only sane architectures and don't require explicit enabling
of virtualization.

At a glance, arm64 won't explode, but enabling hardware _twice_ during kvm_init()
is all kinds of gross.

Another wart that we can clean up is the cpus_hardware_enabled mask.  I don't see
any reason KVM needs to use a global mask, a per-cpu variable a la kvm_arm_hardware_enabled
would do just fine.

OMG, and there's another bug lurking (I need to stop looking at this code).  Commit
5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") added an error
path that can cause VM creation to fail _after_ it has been added to the list, but
doesn't unwind _any_ of the stuff done by kvm_arch_post_init_vm() and beyond.

Rather than trying to rework common KVM to fit all the architectures random needs,
I think we should instead overhaul the entire mess.  And we should do that ASAP
ahead of TDX, though obviously with an eye toward not sucking for TDX.

Not 100% thought out at this point, but I think we can do:

  1.  Have x86 snapshot per-CPU user-return MRS on first use (trivial to do by adding
      a flag to struct kvm_user_return_msrs, as user_return_msrs is already per-CPU).

  2.  Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock and
      cpu_read_lock().

  3.  Provide arch hooks that are invoked for "power management" operations (including
      CPU hotplug and host reboot, hence the quotes).  Note, there's both a platform-
      wide PM notifier and a per-CPU notifier...

  4.  Rename kvm_arch_post_init_vm() to e.g. kvm_arch_add_vm(), call it under
      kvm_lock, and pass in kvm_usage_count.

  5a. Drop cpus_hardware_enabled and drop the common hardware enable/disable code.

 or 

  5b. Expose kvm_hardware_enable_all() and/or kvm_hardware_enable() so that archs
      don't need to implement their own error handling and per-CPU flags.

I.e. give each architecture hooks to handle possible transition points, but otherwise
let arch code decide when and how to do hardware enabling/disabling. 

I'm very tempted to vote for (5a); x86 is the only architecture has an error path
in kvm_arch_hardware_enable(), and trying to get common code to play nice with arm's
kvm_arm_hardware_enabled logic is probably going to be weird.

E.g. if we can get the back half kvm_create_vm() to look like the below, then arch
code can enable hardware during kvm_arch_add_vm() if the existing count is zero
without generic KVM needing to worry about when hardware needs to be enabled and
disabled.

	r = kvm_arch_init_vm(kvm, type);
	if (r)
		goto out_err_no_arch_destroy_vm;

	r = kvm_init_mmu_notifier(kvm);
	if (r)
		goto out_err_no_mmu_notifier;

	/*
	 * When the fd passed to this ioctl() is opened it pins the module,
	 * but try_module_get() also prevents getting a reference if the module
	 * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait").
	 */
	if (!try_module_get(kvm_chardev_ops.owner)) {
		r = -ENODEV;
		goto out_err;
	}

	mutex_lock(&kvm_lock);
	cpus_read_lock();
	r = kvm_arch_add_vm(kvm, kvm_usage_count);
	if (r)
		goto out_final;
	kvm_usage_count++;
	list_add(&kvm->vm_list, &vm_list);
	cpus_read_unlock();
	mutex_unlock(&kvm_lock);

	if (r)
		goto out_put_module;

	preempt_notifier_inc();
	kvm_init_pm_notifier(kvm);

	return kvm;

out_final:
	cpus_read_unlock();
	mutex_unlock(&kvm_lock);
	module_put(kvm_chardev_ops.owner);
out_err_no_put_module:
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
	if (kvm->mmu_notifier.ops)
		mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
#endif
out_err_no_mmu_notifier:
	kvm_arch_destroy_vm(kvm);
Huang, Kai Aug. 12, 2022, 11:35 a.m. UTC | #4
On Thu, 2022-08-11 at 17:39 +0000, Sean Christopherson wrote:
> +Will (for arm crud)
> 
> On Thu, Aug 11, 2022, Huang, Kai wrote:
> > First of all, I think the patch title can be improved.  "refactor CPU
> > compatibility check on module initialization" isn't the purpose of this patch. 
> > It is just a bonus.  The title should reflect the main purpose (or behaviour) of
> > this patch:
> > 
> > 	KVM: Temporarily enable hardware on all cpus during module loading time
> 
> ...
> 
> > > +	/* hardware_enable_nolock() checks CPU compatibility on each CPUs. */
> > > +	r = hardware_enable_all();
> > > +	if (r)
> > > +		goto out_free_2;
> > > +	/*
> > > +	 * Arch specific initialization that requires to enable virtualization
> > > +	 * feature.  e.g. TDX module initialization requires VMXON on all
> > > +	 * present CPUs.
> > > +	 */
> > > +	kvm_arch_post_hardware_enable_setup(opaque);
> > > +	/*
> > > +	 * Make hardware disabled after the KVM module initialization.  KVM
> > > +	 * enables hardware when the first KVM VM is created and disables
> > > +	 * hardware when the last KVM VM is destroyed.  When no KVM VM is
> > > +	 * running, hardware is disabled.  Keep that semantics.
> > > +	 */
> > 
> > Except the first sentence, the remaining sentences are more like changelog
> > material.  Perhaps just say something below to be more specific on the purpose:
> > 
> > 	/*
> > 	 * Disable hardware on all cpus so that out-of-tree drivers which
> > 	 * also use hardware-assisted virtualization (such as virtualbox
> > 	 * kernel module) can still be loaded when KVM is loaded.
> > 	 */
> > 
> > > +	hardware_disable_all();
> > >  
> > >  	r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
> > >  				      kvm_starting_cpu, kvm_dying_cpu);
> 
> I've been poking at the "hardware enable" code this week for other reasons, and
> have come to the conclusion that the current implementation is a mess.

Thanks for the lengthy reply :)

First of all, to clarify, I guess by "current implementation" you mean the
current upstream KVM code, but not this particular patch? :)

> 
> x86 overloads "hardware enable" to do three different things:
> 
>   1. actually enable hardware
>   2. snapshot per-CPU MSR value for user-return MSRs
>   3. handle unstable TSC _for existing VMs_ on suspend+resume and/or CPU hotplug
> 
> #2 and #3 have nothing to do with enabling hardware, kvm_arch_hardware_enable() just
> so happens to be called in a superset of what is needed for dealing with unstable TSCs,
> and AFAICT the user-return MSRs is simply a historical wart.  The user-return MSRs
> code is subtly very, very nasty, as it means that KVM snaphots MSRs from IRQ context,
> e.g. if an out-of-tree module is running VMs, the IRQ can interrupt the _guest_ and
> cause KVM to snapshot guest registers.  VMX and SVM kinda sorta guard against this
> by refusing to load if VMX/SVM are already enabled, but it's not foolproof.
> 
> Eww, and #3 is broken.  If CPU (un)hotplug collides with kvm_destroy_vm() or
> kvm_create_vm(), kvm_arch_hardware_enable() could explode due to vm_list being
> modified while it's being walked.

Agreed.

> 
> Of course, that path is broken for other reasons too, e.g. needs to prevent CPUs
> from going on/off-line when KVM is enabling hardware.
> https://lore.kernel.org/all/20220216031528.92558-7-chao.gao@intel.com

If I read correctly, the problem described in above link seems only to be true
after we move CPUHP_AP_KVM_STARTING from STARTING section to ONLINE section, but
this hasn't been done yet in the current upstream KVM.  Currently,
CPUHP_AP_KVM_STARTING is still in STARTING section so it is guaranteed it has
been executed before start_secondary sets itself to online cpu mask. 

Btw I saw v4 of Chao's patchset was sent Feb this year.  It seems that series
indeed improved CPU compatibility check and hotplug handling.  Any reason that
series wasn't merged?

> 
> arm64 is also quite evil and circumvents KVM's hardware enabling logic to some extent.
> kvm_arch_init() => init_subsystems() unconditionally enables hardware, and for pKVM
> _leaves_ hardware enabled.  And then hyp_init_cpu_pm_notifier() disables/enables
> hardware across lower power enter+exit, except if pKVM is enabled.  The icing on
> the cake is "disabling" hardware doesn't even do anything (AFAICT) if the kernel is
> running at EL2 (which I think is nVHE + not-pKVM?).
> 
> PPC apparently didn't want to be left out of the party, and despite having a nop
> for kvm_arch_hardware_disable(), it does its own "is KVM enabled" tracking (see
> kvm_hv_vm_(de)activated()).  At least PPC gets the cpus_read_(un)lock() stuff right...
> 
> MIPS doesn't appear to have any shenanigans, but kvm_vz_hardware_enable() appears
> to be a "heavy" operation, i.e. ideally not something that should be done spuriously.
> 
> s390 and PPC are the only sane architectures and don't require explicit enabling
> of virtualization.
> 
> At a glance, arm64 won't explode, but enabling hardware _twice_ during kvm_init()
> is all kinds of gross.
> 
> Another wart that we can clean up is the cpus_hardware_enabled mask.  I don't see
> any reason KVM needs to use a global mask, a per-cpu variable a la kvm_arm_hardware_enabled
> would do just fine.
> 
> OMG, and there's another bug lurking (I need to stop looking at this code).  Commit
> 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") added an error
> path that can cause VM creation to fail _after_ it has been added to the list, but
> doesn't unwind _any_ of the stuff done by kvm_arch_post_init_vm() and beyond.
> 
> Rather than trying to rework common KVM to fit all the architectures random needs,
> I think we should instead overhaul the entire mess.  And we should do that ASAP
> ahead of TDX, though obviously with an eye toward not sucking for TDX.
> 
> Not 100% thought out at this point, but I think we can do:
> 
>   1.  Have x86 snapshot per-CPU user-return MRS on first use (trivial to do by adding
>       a flag to struct kvm_user_return_msrs, as user_return_msrs is already per-CPU).
> 
>   2.  Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock and
>       cpu_read_lock().

Agreed spinlock should/can be removed/replaced.  It seems there's no need to use
spinlock.

Also agreed that kvm_lock should be used.  But I am not sure whether
cpus_read_lock() is needed (whether CPU hotplug should be prevented).  In
current KVM, we don't do CPU compatibility check for hotplug CPU anyway, so when
KVM does CPU compatibility check using for_each_online_cpu(), if CPU hotplug
(hot-removal) happens, the worst case is we lose compatibility check on that
CPU.

Or perhaps I am missing something?

> 
>   3.  Provide arch hooks that are invoked for "power management" operations (including
>       CPU hotplug and host reboot, hence the quotes).  Note, there's both a platform-
>       wide PM notifier and a per-CPU notifier...
> 
>   4.  Rename kvm_arch_post_init_vm() to e.g. kvm_arch_add_vm(), call it under
>       kvm_lock, and pass in kvm_usage_count.
> 
>   5a. Drop cpus_hardware_enabled and drop the common hardware enable/disable code.
> 
>  or 
> 
>   5b. Expose kvm_hardware_enable_all() and/or kvm_hardware_enable() so that archs
>       don't need to implement their own error handling and per-CPU flags.
> 
> I.e. give each architecture hooks to handle possible transition points, but otherwise
> let arch code decide when and how to do hardware enabling/disabling. 
> 
> I'm very tempted to vote for (5a); x86 is the only architecture has an error path
> in kvm_arch_hardware_enable(), and trying to get common code to play nice with arm's
> kvm_arm_hardware_enabled logic is probably going to be weird.
> 
> E.g. if we can get the back half kvm_create_vm() to look like the below, then arch
> code can enable hardware during kvm_arch_add_vm() if the existing count is zero
> without generic KVM needing to worry about when hardware needs to be enabled and
> disabled.
> 
> 	r = kvm_arch_init_vm(kvm, type);
> 	if (r)
> 		goto out_err_no_arch_destroy_vm;
> 
> 	r = kvm_init_mmu_notifier(kvm);
> 	if (r)
> 		goto out_err_no_mmu_notifier;
> 
> 	/*
> 	 * When the fd passed to this ioctl() is opened it pins the module,
> 	 * but try_module_get() also prevents getting a reference if the module
> 	 * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait").
> 	 */
> 	if (!try_module_get(kvm_chardev_ops.owner)) {
> 		r = -ENODEV;
> 		goto out_err;
> 	}
> 
> 	mutex_lock(&kvm_lock);
> 	cpus_read_lock();
> 	r = kvm_arch_add_vm(kvm, kvm_usage_count);

Holding cpus_read_lock() here implies CPU hotplug cannot happen during
kvm_arch_add_vm().  This needs a justification/comment to explain why.  

Also, assuming we have a justification, since (based on your description above)
arch _may_ choose to enable hardware within it, but it is not a _must_.  So
maybe remove cpus_read_lock() here and let kvm_arch_add_vm() to decide whether
to use it?

> 	if (r)
> 		goto out_final;
> 	kvm_usage_count++;
> 	list_add(&kvm->vm_list, &vm_list);
> 	cpus_read_unlock();
> 	mutex_unlock(&kvm_lock);
> 
> 	if (r)
> 		goto out_put_module;
> 
> 	preempt_notifier_inc();
> 	kvm_init_pm_notifier(kvm);
> 
> 	return kvm;
> 
> out_final:
> 	cpus_read_unlock();
> 	mutex_unlock(&kvm_lock);
> 	module_put(kvm_chardev_ops.owner);
> out_err_no_put_module:
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> 	if (kvm->mmu_notifier.ops)
> 		mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
> #endif
> out_err_no_mmu_notifier:
> 	kvm_arch_destroy_vm(kvm);
Sean Christopherson Aug. 15, 2022, 10:35 p.m. UTC | #5
On Fri, Aug 12, 2022, Huang, Kai wrote:
> On Thu, 2022-08-11 at 17:39 +0000, Sean Christopherson wrote:
> > I've been poking at the "hardware enable" code this week for other reasons, and
> > have come to the conclusion that the current implementation is a mess.
> 
> Thanks for the lengthy reply :)
> 
> First of all, to clarify, I guess by "current implementation" you mean the
> current upstream KVM code, but not this particular patch? :)

Yeah, upstream code.

> > Of course, that path is broken for other reasons too, e.g. needs to prevent CPUs
> > from going on/off-line when KVM is enabling hardware.
> > https://lore.kernel.org/all/20220216031528.92558-7-chao.gao@intel.com
> 
> If I read correctly, the problem described in above link seems only to be true
> after we move CPUHP_AP_KVM_STARTING from STARTING section to ONLINE section, but
> this hasn't been done yet in the current upstream KVM.  Currently,
> CPUHP_AP_KVM_STARTING is still in STARTING section so it is guaranteed it has
> been executed before start_secondary sets itself to online cpu mask. 

The lurking issue is that for_each_online_cpu() can against hotplug, i.e. every
instance of for_each_online_cpu() in KVM is buggy (at least on the x86 side, I
can't tell at a glance whether or not arm pKVM's usage is safe).

https://lore.kernel.org/all/87bl20aa72.ffs@tglx

> Btw I saw v4 of Chao's patchset was sent Feb this year.  It seems that series
> indeed improved CPU compatibility check and hotplug handling.  Any reason that
> series wasn't merged?

AFAIK it was just a lack of reviews/acks for the non-KVM patches.

> Also agreed that kvm_lock should be used.  But I am not sure whether
> cpus_read_lock() is needed (whether CPU hotplug should be prevented).  In
> current KVM, we don't do CPU compatibility check for hotplug CPU anyway, so when
> KVM does CPU compatibility check using for_each_online_cpu(), if CPU hotplug
> (hot-removal) happens, the worst case is we lose compatibility check on that
> CPU.
> 
> Or perhaps I am missing something?

On a hot-add of an incompatible CPU, KVM would potentially skip the compatibility
check and try to enable hardware on an incompatible/broken CPU.

Another possible bug is the checking of hv_get_vp_assist_page(); hot-adding a
CPU that failed to allocate the VP assist page while vmx_init() is checking online
CPUs could result in a NULL pointer deref due to KVM not rejecting the CPU as it
should.	

> > 	mutex_lock(&kvm_lock);
> > 	cpus_read_lock();
> > 	r = kvm_arch_add_vm(kvm, kvm_usage_count);
> 
> Holding cpus_read_lock() here implies CPU hotplug cannot happen during
> kvm_arch_add_vm().  This needs a justification/comment to explain why.  

Oh, for sure, the above was only intended to be a rough sketch, definitely not a
formal patch.

> Also, assuming we have a justification, since (based on your description above)
> arch _may_ choose to enable hardware within it, but it is not a _must_.  So
> maybe remove cpus_read_lock() here and let kvm_arch_add_vm() to decide whether
> to use it?

My thought is that it makes sense to provide a CPU hotplug friendly arch hook since
(a) all architectures except s390 (and maybe PPC) need such a hook (or more likely,
multiple hooks) and (b) cpus_read_lock() will almost never be contended.

In other words, address the problem that's easy to solve (but also easy to forget)
in generic code, but let architectures deal with the hardware enabling problem,
which is a mess to solve in generic code (unless I'm missing something).

All that said, I haven't tried compiling yet, let alone actually running anything,
so it's entirely possible my idea won't pan out.
Huang, Kai Aug. 15, 2022, 11:06 p.m. UTC | #6
On Mon, 2022-08-15 at 22:35 +0000, Sean Christopherson wrote:
> On Fri, Aug 12, 2022, Huang, Kai wrote:
> > On Thu, 2022-08-11 at 17:39 +0000, Sean Christopherson wrote:
> > > I've been poking at the "hardware enable" code this week for other reasons, and
> > > have come to the conclusion that the current implementation is a mess.
> > 
> > Thanks for the lengthy reply :)
> > 
> > First of all, to clarify, I guess by "current implementation" you mean the
> > current upstream KVM code, but not this particular patch? :)
> 
> Yeah, upstream code.
> 
> > > Of course, that path is broken for other reasons too, e.g. needs to prevent CPUs
> > > from going on/off-line when KVM is enabling hardware.
> > > https://lore.kernel.org/all/20220216031528.92558-7-chao.gao@intel.com
> > 
> > If I read correctly, the problem described in above link seems only to be true
> > after we move CPUHP_AP_KVM_STARTING from STARTING section to ONLINE section, but
> > this hasn't been done yet in the current upstream KVM.  Currently,
> > CPUHP_AP_KVM_STARTING is still in STARTING section so it is guaranteed it has
> > been executed before start_secondary sets itself to online cpu mask. 
> 
> The lurking issue is that for_each_online_cpu() can against hotplug, i.e. every
> instance of for_each_online_cpu() in KVM is buggy (at least on the x86 side, I
> can't tell at a glance whether or not arm pKVM's usage is safe).
> 
> https://lore.kernel.org/all/87bl20aa72.ffs@tglx

Yes agreed. for_each_online_cpu() can race with CPU hotplug.

But the fact is looks there are many places using for_each_online_cpus() w/o
holding cpus_read_lock().  :)

> 
> > Btw I saw v4 of Chao's patchset was sent Feb this year.  It seems that series
> > indeed improved CPU compatibility check and hotplug handling.  Any reason that
> > series wasn't merged?
> 
> AFAIK it was just a lack of reviews/acks for the non-KVM patches.
> 
> > Also agreed that kvm_lock should be used.  But I am not sure whether
> > cpus_read_lock() is needed (whether CPU hotplug should be prevented).  In
> > current KVM, we don't do CPU compatibility check for hotplug CPU anyway, so when
> > KVM does CPU compatibility check using for_each_online_cpu(), if CPU hotplug
> > (hot-removal) happens, the worst case is we lose compatibility check on that
> > CPU.
> > 
> > Or perhaps I am missing something?
> 
> On a hot-add of an incompatible CPU, KVM would potentially skip the compatibility
> check and try to enable hardware on an incompatible/broken CPU.

To resolve this, we need to do compatibility check before actually enabling
hardware on each cpu, as Chao's series did.  I don't see using cpus_read_lock()
alone can actually fix anything.

> 
> Another possible bug is the checking of hv_get_vp_assist_page(); hot-adding a
> CPU that failed to allocate the VP assist page while vmx_init() is checking online
> CPUs could result in a NULL pointer deref due to KVM not rejecting the CPU as it
> should.
> 

So we need Chao's series to fix those problems: 1) Do compatibility check before
actually enable the hardware for each cpu; 2) allow CPU hotplug to fail; 3) Hold
cpus_read_lock() when needed.
Isaku Yamahata Aug. 23, 2022, 5:27 a.m. UTC | #7
On Fri, Aug 12, 2022 at 11:35:29AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> >   3.  Provide arch hooks that are invoked for "power management" operations (including
> >       CPU hotplug and host reboot, hence the quotes).  Note, there's both a platform-
> >       wide PM notifier and a per-CPU notifier...
> > 
> >   4.  Rename kvm_arch_post_init_vm() to e.g. kvm_arch_add_vm(), call it under
> >       kvm_lock, and pass in kvm_usage_count.
> > 
> >   5a. Drop cpus_hardware_enabled and drop the common hardware enable/disable code.
> > 
> >  or 
> > 
> >   5b. Expose kvm_hardware_enable_all() and/or kvm_hardware_enable() so that archs
> >       don't need to implement their own error handling and per-CPU flags.
> > 
> > I.e. give each architecture hooks to handle possible transition points, but otherwise
> > let arch code decide when and how to do hardware enabling/disabling. 
> > 
> > I'm very tempted to vote for (5a); x86 is the only architecture has an error path
> > in kvm_arch_hardware_enable(), and trying to get common code to play nice with arm's
> > kvm_arm_hardware_enabled logic is probably going to be weird.
> > 

I ended up with (5a) with the following RFC patches.
https://lore.kernel.org/kvm/cover.1660974106.git.isaku.yamahata@intel.com/T/#m0239e7800b66174b49c5b1049462aad50293a994


> > E.g. if we can get the back half kvm_create_vm() to look like the below, then arch
> > code can enable hardware during kvm_arch_add_vm() if the existing count is zero
> > without generic KVM needing to worry about when hardware needs to be enabled and
> > disabled.
> > 
> > 	r = kvm_arch_init_vm(kvm, type);
> > 	if (r)
> > 		goto out_err_no_arch_destroy_vm;
> > 
> > 	r = kvm_init_mmu_notifier(kvm);
> > 	if (r)
> > 		goto out_err_no_mmu_notifier;
> > 
> > 	/*
> > 	 * When the fd passed to this ioctl() is opened it pins the module,
> > 	 * but try_module_get() also prevents getting a reference if the module
> > 	 * is in MODULE_STATE_GOING (e.g. if someone ran "rmmod --wait").
> > 	 */
> > 	if (!try_module_get(kvm_chardev_ops.owner)) {
> > 		r = -ENODEV;
> > 		goto out_err;
> > 	}
> > 
> > 	mutex_lock(&kvm_lock);
> > 	cpus_read_lock();
> > 	r = kvm_arch_add_vm(kvm, kvm_usage_count);
> 
> Holding cpus_read_lock() here implies CPU hotplug cannot happen during
> kvm_arch_add_vm().  This needs a justification/comment to explain why.  
> 
> Also, assuming we have a justification, since (based on your description above)
> arch _may_ choose to enable hardware within it, but it is not a _must_.  So
> maybe remove cpus_read_lock() here and let kvm_arch_add_vm() to decide whether
> to use it?

For now, I put locking outside of kvm_arch_{add, del}_vm().  But I haven't looked
at non-x86 arch.  Probably arm code has its preference for its implementation.
Marc Zyngier Sept. 1, 2022, 9:03 a.m. UTC | #8
Sean,

On Thu, 11 Aug 2022 18:39:53 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> +Will (for arm crud)

When it comes to KVM/arm64, I'd appreciate if you could Cc me.

> arm64 is also quite evil and circumvents KVM's hardware enabling
> logic to some extent.  kvm_arch_init() => init_subsystems()
> unconditionally enables hardware, and for pKVM _leaves_ hardware
> enabled.  And then hyp_init_cpu_pm_notifier() disables/enables
> hardware across lower power enter+exit, except if pKVM is enabled.
> The icing on the cake is "disabling" hardware doesn't even do
> anything (AFAICT) if the kernel is running at EL2 (which I think is
> nVHE + not-pKVM?).

In the cases where disabling doesn't do anything (which are the exact
opposite of the cases you describe), that's because there is
absolutely *nothing* to do:

- If VHE, the kernel is the bloody hypervisor: disable virtualisation,
  kill the kernel.

- if pKVM, the kernel is basically a guest, and has no business
  touching anything at all.

So much the 'evil' behaviour.

	M.
Sean Christopherson Sept. 1, 2022, 2:08 p.m. UTC | #9
On Thu, Sep 01, 2022, Marc Zyngier wrote:
> Sean,
> 
> On Thu, 11 Aug 2022 18:39:53 +0100,
> Sean Christopherson <seanjc@google.com> wrote:
> > 
> > +Will (for arm crud)
> 
> When it comes to KVM/arm64, I'd appreciate if you could Cc me.

Sorry, will do.

> > arm64 is also quite evil and circumvents KVM's hardware enabling
> > logic to some extent.  kvm_arch_init() => init_subsystems()
> > unconditionally enables hardware, and for pKVM _leaves_ hardware
> > enabled.  And then hyp_init_cpu_pm_notifier() disables/enables
> > hardware across lower power enter+exit, except if pKVM is enabled.
> > The icing on the cake is "disabling" hardware doesn't even do
> > anything (AFAICT) if the kernel is running at EL2 (which I think is
> > nVHE + not-pKVM?).
> 
> In the cases where disabling doesn't do anything (which are the exact
> opposite of the cases you describe), that's because there is
> absolutely *nothing* to do:

Yes, I know.

> - If VHE, the kernel is the bloody hypervisor: disable virtualisation,
>   kill the kernel.
> 
> - if pKVM, the kernel is basically a guest, and has no business
>   touching anything at all.
> 
> So much the 'evil' behaviour.

The colorful language is tongue-in-cheek.

I get the impression that you feel I am attacking ARM.  That is very much not what
I intended.  If anything, I'm attacking x86 for forcing its quirks on everyone else.

What am trying to point out here is that ARM and other architectures are not
well-served by KVM's current hardware enabling/disabling infrastructure.  I am not
saying that ARM is broken and needs to be fixed, I am saying that KVM is broken and
needs to be fixed, and that ARM is a victim of KVM's x86-centric origins.
diff mbox series

Patch

diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 092d09fb6a7e..fd7339cff57c 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1642,12 +1642,11 @@  static int __init kvm_mips_init(void)
 		return -EOPNOTSUPP;
 	}
 
+	/*
+	 * kvm_init() calls kvm_arch_hardware_enable/disable().  The early
+	 * initialization is needed before calling kvm_init().
+	 */
 	ret = kvm_mips_entry_setup();
-	if (ret)
-		return ret;
-
-	ret = kvm_init(NULL, sizeof(struct kvm_vcpu), 0, THIS_MODULE);
-
 	if (ret)
 		return ret;
 
@@ -1656,6 +1655,13 @@  static int __init kvm_mips_init(void)
 
 	register_die_notifier(&kvm_mips_csr_die_notifier);
 
+	ret = kvm_init(NULL, sizeof(struct kvm_vcpu), 0, THIS_MODULE);
+
+	if (ret) {
+		unregister_die_notifier(&kvm_mips_csr_die_notifier);
+		return ret;
+	}
+
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c5637a4c1c43..4d7e1073984d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8433,6 +8433,23 @@  static void vmx_exit(void)
 }
 module_exit(vmx_exit);
 
+/*
+ * Early initialization before kvm_init() so that vmx_hardware_enable/disable()
+ * can work.
+ */
+static void __init vmx_init_early(void)
+{
+	int cpu;
+
+	/*
+	 * vmx_hardware_disable() accesses loaded_vmcss_on_cpu list.
+	 * Initialize the variable before kvm_init() that calls
+	 * vmx_hardware_enable/disable().
+	 */
+	for_each_possible_cpu(cpu)
+		INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
+}
+
 static int __init vmx_init(void)
 {
 	int r, cpu;
@@ -8470,6 +8487,7 @@  static int __init vmx_init(void)
 	}
 #endif
 
+	vmx_init_early();
 	r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),
 		     __alignof__(struct vcpu_vmx), THIS_MODULE);
 	if (r)
@@ -8490,11 +8508,8 @@  static int __init vmx_init(void)
 
 	vmx_setup_fb_clear_ctrl();
 
-	for_each_possible_cpu(cpu) {
-		INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
-
+	for_each_possible_cpu(cpu)
 		pi_init_cpu(cpu);
-	}
 
 #ifdef CONFIG_KEXEC_CORE
 	rcu_assign_pointer(crash_vmclear_loaded_vmcss,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9643b8eadefe..63d4e3ce3e53 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1437,6 +1437,7 @@  static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
 int kvm_arch_hardware_enable(void);
 void kvm_arch_hardware_disable(void);
 int kvm_arch_hardware_setup(void *opaque);
+int kvm_arch_post_hardware_enable_setup(void *opaque);
 void kvm_arch_hardware_unsetup(void);
 int kvm_arch_check_processor_compat(void);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f5767e5ae45..a43fdbfaede2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4995,8 +4995,13 @@  static void hardware_enable_nolock(void *junk)
 
 	cpumask_set_cpu(cpu, cpus_hardware_enabled);
 
+	r = kvm_arch_check_processor_compat();
+	if (r)
+		goto out;
+
 	r = kvm_arch_hardware_enable();
 
+out:
 	if (r) {
 		cpumask_clear_cpu(cpu, cpus_hardware_enabled);
 		atomic_inc(&hardware_enable_failed);
@@ -5793,9 +5798,9 @@  void kvm_unregister_perf_callbacks(void)
 }
 #endif
 
-static void check_processor_compat(void *rtn)
+__weak int kvm_arch_post_hardware_enable_setup(void *opaque)
 {
-	*(int *)rtn = kvm_arch_check_processor_compat();
+	return 0;
 }
 
 int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
@@ -5828,11 +5833,23 @@  int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	if (r < 0)
 		goto out_free_1;
 
-	for_each_online_cpu(cpu) {
-		smp_call_function_single(cpu, check_processor_compat, &r, 1);
-		if (r < 0)
-			goto out_free_2;
-	}
+	/* hardware_enable_nolock() checks CPU compatibility on each CPUs. */
+	r = hardware_enable_all();
+	if (r)
+		goto out_free_2;
+	/*
+	 * Arch specific initialization that requires to enable virtualization
+	 * feature.  e.g. TDX module initialization requires VMXON on all
+	 * present CPUs.
+	 */
+	kvm_arch_post_hardware_enable_setup(opaque);
+	/*
+	 * Make hardware disabled after the KVM module initialization.  KVM
+	 * enables hardware when the first KVM VM is created and disables
+	 * hardware when the last KVM VM is destroyed.  When no KVM VM is
+	 * running, hardware is disabled.  Keep that semantics.
+	 */
+	hardware_disable_all();
 
 	r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
 				      kvm_starting_cpu, kvm_dying_cpu);