Message ID | 20250224235542.2562848-2-seanjc@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | KVM: x86: nVMX IRQ fix and VM teardown cleanups | expand |
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 58b82d6fd77c..045c61cc7e54 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -12890,11 +12890,11 @@ void kvm_arch_destroy_vm(struct kvm *kvm) > mutex_unlock(&kvm->slots_lock); > } > kvm_unload_vcpu_mmus(kvm); > + kvm_destroy_vcpus(kvm); > kvm_x86_call(vm_destroy)(kvm); > kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1)); > kvm_pic_destroy(kvm); > kvm_ioapic_destroy(kvm); > - kvm_destroy_vcpus(kvm); > kvfree(rcu_dereference_check(kvm->arch.apic_map, 1)); > kfree(srcu_dereference_check(kvm->arch.pmu_event_filter, &kvm->srcu, 1)); > kvm_mmu_uninit_vm(kvm); After this change, now the sequence is that 1. kvm_arch_pre_destroy_vm() 2. kvm_arch_destroy_vm() 2.1 kvm_destroy_vcpus() 2.2 .vm_destroy hook 2.3 kvm_mmu_uninit_vm() --> mirror root ref is 1 upon here. Zap the mirror root and reclaim SETP page table pages. 2.4 .vm_free hook Since TDX needs to reclaim the TDR page after reclaiming all other pages, we currently added a vm_free hook at 2.4, after 2.3. Could we move kvm_mmu_uninit_vm() before the .vm_destroy hook and after kvm_destroy_vcpus()? Or move the .vm_destroy hook after kvm_mmu_uninit_vm(), e.g. after kvm_page_track_cleanup()? Otherwise, TDX still needs to introduce the .vm_free hook, which is invoked at the end of kvm_arch_destroy_vm().
On Tue, Feb 25, 2025, Yan Zhao wrote: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 58b82d6fd77c..045c61cc7e54 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -12890,11 +12890,11 @@ void kvm_arch_destroy_vm(struct kvm *kvm) > > mutex_unlock(&kvm->slots_lock); > > } > > kvm_unload_vcpu_mmus(kvm); > > + kvm_destroy_vcpus(kvm); > > kvm_x86_call(vm_destroy)(kvm); > > kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1)); > > kvm_pic_destroy(kvm); > > kvm_ioapic_destroy(kvm); > > - kvm_destroy_vcpus(kvm); > > kvfree(rcu_dereference_check(kvm->arch.apic_map, 1)); > > kfree(srcu_dereference_check(kvm->arch.pmu_event_filter, &kvm->srcu, 1)); > > kvm_mmu_uninit_vm(kvm); > After this change, now the sequence is that > > 1. kvm_arch_pre_destroy_vm() > 2. kvm_arch_destroy_vm() > 2.1 kvm_destroy_vcpus() > 2.2 .vm_destroy hook > 2.3 kvm_mmu_uninit_vm() --> mirror root ref is 1 upon here. Zap the mirror > root and reclaim SETP page table pages. > 2.4 .vm_free hook > > Since TDX needs to reclaim the TDR page after reclaiming all other pages, we > currently added a vm_free hook at 2.4, after 2.3. > > Could we move kvm_mmu_uninit_vm() before the .vm_destroy hook and after > kvm_destroy_vcpus()? > > Or move the .vm_destroy hook after kvm_mmu_uninit_vm(), e.g. after > kvm_page_track_cleanup()? I would go for the first option. I'll tack on a patch since I need to test all of these flows anyways, and I would much prefer to change course sooner rather than later if it doesn't work for whatever reason. Is this comment accurate? diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1e5f6f820c0b..f5685f153e08 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12874,13 +12874,19 @@ void kvm_arch_destroy_vm(struct kvm *kvm) mutex_unlock(&kvm->slots_lock); } kvm_destroy_vcpus(kvm); + + /* + * Do final MMU teardown prior to calling into vendor code. All pages + * that were donated to the TDX module, e.g. for S-EPT tables, need to + * be reclaimed before the VM metadata page can be freed. + */ + kvm_mmu_uninit_vm(kvm); kvm_x86_call(vm_destroy)(kvm); kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1)); kvm_pic_destroy(kvm); kvm_ioapic_destroy(kvm); kvfree(rcu_dereference_check(kvm->arch.apic_map, 1)); kfree(srcu_dereference_check(kvm->arch.pmu_event_filter, &kvm->srcu, 1)); - kvm_mmu_uninit_vm(kvm); kvm_page_track_cleanup(kvm); kvm_xen_destroy_vm(kvm); kvm_hv_destroy_vm(kvm);
On 2/25/25 00:55, Sean Christopherson wrote: > Free vCPUs before freeing any VM state, as both SVM and VMX may access > VM state when "freeing" a vCPU that is currently "in" L2, i.e. that needs > to be kicked out of nested guest mode. > > Commit 6fcee03df6a1 ("KVM: x86: avoid loading a vCPU after .vm_destroy was > called") partially fixed the issue, but for unknown reasons only moved the > MMU unloading before VM destruction. Complete the change, and free all > vCPU state prior to destroying VM state, as nVMX accesses even more state > than nSVM. I applied this to kvm-coco-queue, I will place it in kvm/master too unless you shout. Paolo > In addition to the AVIC, KVM can hit a use-after-free on MSR filters: > > kvm_msr_allowed+0x4c/0xd0 > __kvm_set_msr+0x12d/0x1e0 > kvm_set_msr+0x19/0x40 > load_vmcs12_host_state+0x2d8/0x6e0 [kvm_intel] > nested_vmx_vmexit+0x715/0xbd0 [kvm_intel] > nested_vmx_free_vcpu+0x33/0x50 [kvm_intel] > vmx_free_vcpu+0x54/0xc0 [kvm_intel] > kvm_arch_vcpu_destroy+0x28/0xf0 > kvm_vcpu_destroy+0x12/0x50 > kvm_arch_destroy_vm+0x12c/0x1c0 > kvm_put_kvm+0x263/0x3c0 > kvm_vm_release+0x21/0x30 > > and an upcoming fix to process injectable interrupts on nested VM-Exit > will access the PIC: > > BUG: kernel NULL pointer dereference, address: 0000000000000090 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > CPU: 23 UID: 1000 PID: 2658 Comm: kvm-nx-lpage-re > RIP: 0010:kvm_cpu_has_extint+0x2f/0x60 [kvm] > Call Trace: > <TASK> > kvm_cpu_has_injectable_intr+0xe/0x60 [kvm] > nested_vmx_vmexit+0x2d7/0xdf0 [kvm_intel] > nested_vmx_free_vcpu+0x40/0x50 [kvm_intel] > vmx_vcpu_free+0x2d/0x80 [kvm_intel] > kvm_arch_vcpu_destroy+0x2d/0x130 [kvm] > kvm_destroy_vcpus+0x8a/0x100 [kvm] > kvm_arch_destroy_vm+0xa7/0x1d0 [kvm] > kvm_destroy_vm+0x172/0x300 [kvm] > kvm_vcpu_release+0x31/0x50 [kvm] > > Inarguably, both nSVM and nVMX need to be fixed, but punt on those > cleanups for the moment. Conceptually, vCPUs should be freed before VM > state. Assets like the I/O APIC and PIC _must_ be allocated before vCPUs > are created, so it stands to reason that they must be freed _after_ vCPUs > are destroyed. > > Reported-by: Aaron Lewis <aaronlewis@google.com> > Closes: https://lore.kernel.org/all/20240703175618.2304869-2-aaronlewis@google.com > Cc: Jim Mattson <jmattson@google.com> > Cc: Yan Zhao <yan.y.zhao@intel.com> > Cc: Rick P Edgecombe <rick.p.edgecombe@intel.com> > Cc: Kai Huang <kai.huang@intel.com> > Cc: Isaku Yamahata <isaku.yamahata@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/x86.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 58b82d6fd77c..045c61cc7e54 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -12890,11 +12890,11 @@ void kvm_arch_destroy_vm(struct kvm *kvm) > mutex_unlock(&kvm->slots_lock); > } > kvm_unload_vcpu_mmus(kvm); > + kvm_destroy_vcpus(kvm); > kvm_x86_call(vm_destroy)(kvm); > kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1)); > kvm_pic_destroy(kvm); > kvm_ioapic_destroy(kvm); > - kvm_destroy_vcpus(kvm); > kvfree(rcu_dereference_check(kvm->arch.apic_map, 1)); > kfree(srcu_dereference_check(kvm->arch.pmu_event_filter, &kvm->srcu, 1)); > kvm_mmu_uninit_vm(kvm);
On Wed, Feb 26, 2025, Paolo Bonzini wrote: > On 2/25/25 00:55, Sean Christopherson wrote: > > Free vCPUs before freeing any VM state, as both SVM and VMX may access > > VM state when "freeing" a vCPU that is currently "in" L2, i.e. that needs > > to be kicked out of nested guest mode. > > > > Commit 6fcee03df6a1 ("KVM: x86: avoid loading a vCPU after .vm_destroy was > > called") partially fixed the issue, but for unknown reasons only moved the > > MMU unloading before VM destruction. Complete the change, and free all > > vCPU state prior to destroying VM state, as nVMX accesses even more state > > than nSVM. > > I applied this to kvm-coco-queue, I will place it in kvm/master too unless > you shout. Depends on what "this" is :-) My plan/hope is to land patches 1 and 2 in 6.14, i.e. in kvm/master, but the rest are firmly 6.15 IMO. And based on Yan's feedback, I'm planning on adding a few more cleanups (though I think they're fully additive, i.e. can go on top).
On Tue, Feb 25, 2025 at 07:04:55AM -0800, Sean Christopherson wrote: > On Tue, Feb 25, 2025, Yan Zhao wrote: > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index 58b82d6fd77c..045c61cc7e54 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -12890,11 +12890,11 @@ void kvm_arch_destroy_vm(struct kvm *kvm) > > > mutex_unlock(&kvm->slots_lock); > > > } > > > kvm_unload_vcpu_mmus(kvm); > > > + kvm_destroy_vcpus(kvm); > > > kvm_x86_call(vm_destroy)(kvm); > > > kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1)); > > > kvm_pic_destroy(kvm); > > > kvm_ioapic_destroy(kvm); > > > - kvm_destroy_vcpus(kvm); > > > kvfree(rcu_dereference_check(kvm->arch.apic_map, 1)); > > > kfree(srcu_dereference_check(kvm->arch.pmu_event_filter, &kvm->srcu, 1)); > > > kvm_mmu_uninit_vm(kvm); > > After this change, now the sequence is that > > > > 1. kvm_arch_pre_destroy_vm() > > 2. kvm_arch_destroy_vm() > > 2.1 kvm_destroy_vcpus() > > 2.2 .vm_destroy hook > > 2.3 kvm_mmu_uninit_vm() --> mirror root ref is 1 upon here. Zap the mirror > > root and reclaim SETP page table pages. > > 2.4 .vm_free hook > > > > Since TDX needs to reclaim the TDR page after reclaiming all other pages, we > > currently added a vm_free hook at 2.4, after 2.3. > > > > Could we move kvm_mmu_uninit_vm() before the .vm_destroy hook and after > > kvm_destroy_vcpus()? > > > > Or move the .vm_destroy hook after kvm_mmu_uninit_vm(), e.g. after > > kvm_page_track_cleanup()? > > I would go for the first option. I'll tack on a patch since I need to test all > of these flows anyways, and I would much prefer to change course sooner rather > than later if it doesn't work for whatever reason. > > Is this comment accurate? Yes. And since tdx_mmu_release_hkid() is called before kvm_unload_vcpu_mmus(), patch 4 in this series is required by TDX. Otherwise, the list_add in tdx_vcpu_load will cause corruption during tearing down, since tdx_mmu_release_hkid() has already invoked tdx_disassociate_vp() on all vcpus. kvm_unload_vcpu_mmu vcpu_load tdx_vcpu_load list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu)) So, maybe a change as below is also required by TDX in case vcpu_load() is accidentally invoked after .vm_pre_destroy. diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index b67df0af64f3..183192706ced 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -703,7 +704,7 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) struct vcpu_tdx *tdx = to_tdx(vcpu); vmx_vcpu_pi_load(vcpu, cpu); - if (vcpu->cpu == cpu) + if (vcpu->cpu == cpu || !is_hkid_assigned(to_kvm_tdx(vcpu->kvm))) return; tdx_flush_vp_on_cpu(vcpu); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 1e5f6f820c0b..f5685f153e08 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -12874,13 +12874,19 @@ void kvm_arch_destroy_vm(struct kvm *kvm) > mutex_unlock(&kvm->slots_lock); > } > kvm_destroy_vcpus(kvm); > + > + /* > + * Do final MMU teardown prior to calling into vendor code. All pages > + * that were donated to the TDX module, e.g. for S-EPT tables, need to > + * be reclaimed before the VM metadata page can be freed. > + */ > + kvm_mmu_uninit_vm(kvm); > kvm_x86_call(vm_destroy)(kvm); > kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1)); > kvm_pic_destroy(kvm); > kvm_ioapic_destroy(kvm); > kvfree(rcu_dereference_check(kvm->arch.apic_map, 1)); > kfree(srcu_dereference_check(kvm->arch.pmu_event_filter, &kvm->srcu, 1)); > - kvm_mmu_uninit_vm(kvm); > kvm_page_track_cleanup(kvm); > kvm_xen_destroy_vm(kvm); > kvm_hv_destroy_vm(kvm);
On Wed, Feb 26, 2025 at 1:27 AM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Feb 26, 2025, Paolo Bonzini wrote: > > On 2/25/25 00:55, Sean Christopherson wrote: > > > Free vCPUs before freeing any VM state, as both SVM and VMX may access > > > VM state when "freeing" a vCPU that is currently "in" L2, i.e. that needs > > > to be kicked out of nested guest mode. > > > > > > Commit 6fcee03df6a1 ("KVM: x86: avoid loading a vCPU after .vm_destroy was > > > called") partially fixed the issue, but for unknown reasons only moved the > > > MMU unloading before VM destruction. Complete the change, and free all > > > vCPU state prior to destroying VM state, as nVMX accesses even more state > > > than nSVM. > > > > I applied this to kvm-coco-queue, I will place it in kvm/master too unless > > you shout. > > Depends on what "this" is :-) > > My plan/hope is to land patches 1 and 2 in 6.14, i.e. in kvm/master I meant only 1, but if you want to have 2 as well then that's fine too. As to kvm-coco-queue, based on Yan's reply I have 1 (of course), 4 and an extra patch to move kvm_x86_call(vm_destroy) at the very end of kvm_arch_destroy_vm; I'll post everything as soon as I finish building and testing. Paolo > rest are firmly 6.15 IMO. And based on Yan's feedback, I'm planning on adding a > few more cleanups (though I think they're fully additive, i.e. can go on top). >
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 58b82d6fd77c..045c61cc7e54 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12890,11 +12890,11 @@ void kvm_arch_destroy_vm(struct kvm *kvm) mutex_unlock(&kvm->slots_lock); } kvm_unload_vcpu_mmus(kvm); + kvm_destroy_vcpus(kvm); kvm_x86_call(vm_destroy)(kvm); kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1)); kvm_pic_destroy(kvm); kvm_ioapic_destroy(kvm); - kvm_destroy_vcpus(kvm); kvfree(rcu_dereference_check(kvm->arch.apic_map, 1)); kfree(srcu_dereference_check(kvm->arch.pmu_event_filter, &kvm->srcu, 1)); kvm_mmu_uninit_vm(kvm);
Free vCPUs before freeing any VM state, as both SVM and VMX may access VM state when "freeing" a vCPU that is currently "in" L2, i.e. that needs to be kicked out of nested guest mode. Commit 6fcee03df6a1 ("KVM: x86: avoid loading a vCPU after .vm_destroy was called") partially fixed the issue, but for unknown reasons only moved the MMU unloading before VM destruction. Complete the change, and free all vCPU state prior to destroying VM state, as nVMX accesses even more state than nSVM. In addition to the AVIC, KVM can hit a use-after-free on MSR filters: kvm_msr_allowed+0x4c/0xd0 __kvm_set_msr+0x12d/0x1e0 kvm_set_msr+0x19/0x40 load_vmcs12_host_state+0x2d8/0x6e0 [kvm_intel] nested_vmx_vmexit+0x715/0xbd0 [kvm_intel] nested_vmx_free_vcpu+0x33/0x50 [kvm_intel] vmx_free_vcpu+0x54/0xc0 [kvm_intel] kvm_arch_vcpu_destroy+0x28/0xf0 kvm_vcpu_destroy+0x12/0x50 kvm_arch_destroy_vm+0x12c/0x1c0 kvm_put_kvm+0x263/0x3c0 kvm_vm_release+0x21/0x30 and an upcoming fix to process injectable interrupts on nested VM-Exit will access the PIC: BUG: kernel NULL pointer dereference, address: 0000000000000090 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page CPU: 23 UID: 1000 PID: 2658 Comm: kvm-nx-lpage-re RIP: 0010:kvm_cpu_has_extint+0x2f/0x60 [kvm] Call Trace: <TASK> kvm_cpu_has_injectable_intr+0xe/0x60 [kvm] nested_vmx_vmexit+0x2d7/0xdf0 [kvm_intel] nested_vmx_free_vcpu+0x40/0x50 [kvm_intel] vmx_vcpu_free+0x2d/0x80 [kvm_intel] kvm_arch_vcpu_destroy+0x2d/0x130 [kvm] kvm_destroy_vcpus+0x8a/0x100 [kvm] kvm_arch_destroy_vm+0xa7/0x1d0 [kvm] kvm_destroy_vm+0x172/0x300 [kvm] kvm_vcpu_release+0x31/0x50 [kvm] Inarguably, both nSVM and nVMX need to be fixed, but punt on those cleanups for the moment. Conceptually, vCPUs should be freed before VM state. Assets like the I/O APIC and PIC _must_ be allocated before vCPUs are created, so it stands to reason that they must be freed _after_ vCPUs are destroyed. Reported-by: Aaron Lewis <aaronlewis@google.com> Closes: https://lore.kernel.org/all/20240703175618.2304869-2-aaronlewis@google.com Cc: Jim Mattson <jmattson@google.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Rick P Edgecombe <rick.p.edgecombe@intel.com> Cc: Kai Huang <kai.huang@intel.com> Cc: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/x86.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)