Message ID | 20250211025442.3071607-9-binbin.wu@linux.intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | KVM: TDX: TDX hypercalls may exit to userspace | expand |
>diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c >index f13da28dd4a2..8f3147c6e602 100644 >--- a/arch/x86/kvm/vmx/tdx.c >+++ b/arch/x86/kvm/vmx/tdx.c >@@ -849,8 +849,12 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu) > if (tdvmcall_exit_type(vcpu)) > return EXIT_REASON_VMCALL; > >- if (tdvmcall_leaf(vcpu) < 0x10000) >+ if (tdvmcall_leaf(vcpu) < 0x10000) { >+ if (tdvmcall_leaf(vcpu) == EXIT_REASON_EPT_VIOLATION) >+ return EXIT_REASON_EPT_MISCONFIG; IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to distinguish between a genuine EPT MISCONFIG and a morphed one, and handle them differently?
On 2/12/2025 10:28 AM, Chao Gao wrote: >> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c >> index f13da28dd4a2..8f3147c6e602 100644 >> --- a/arch/x86/kvm/vmx/tdx.c >> +++ b/arch/x86/kvm/vmx/tdx.c >> @@ -849,8 +849,12 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu) >> if (tdvmcall_exit_type(vcpu)) >> return EXIT_REASON_VMCALL; >> >> - if (tdvmcall_leaf(vcpu) < 0x10000) >> + if (tdvmcall_leaf(vcpu) < 0x10000) { >> + if (tdvmcall_leaf(vcpu) == EXIT_REASON_EPT_VIOLATION) >> + return EXIT_REASON_EPT_MISCONFIG; > IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to distinguish > between a genuine EPT MISCONFIG and a morphed one, and handle them differently? It will be handled separately, which will be in the last section of the KVM basic support. But the v2 of "the rest" section is on hold because there is a discussion related to MTRR MSR handling: https://lore.kernel.org/all/20250201005048.657470-1-seanjc@google.com/ Want to send the v2 of "the rest" section after the MTRR discussion is finalized. For the genuine EPT misconfig handling, you can refer to the patch on the full KVM branch: https://github.com/intel/tdx/commit/e576682ac586f994bf54eb11b357f3e835d3c042
On Wed, 2025-02-12 at 10:39 +0800, Binbin Wu wrote: > > IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to > > distinguish > > between a genuine EPT MISCONFIG and a morphed one, and handle them > > differently? > It will be handled separately, which will be in the last section of the KVM > basic support. But the v2 of "the rest" section is on hold because there is > a discussion related to MTRR MSR handling: > https://lore.kernel.org/all/20250201005048.657470-1-seanjc@google.com/ > Want to send the v2 of "the rest" section after the MTRR discussion is > finalized. I think we can just put back the original MTRR code (post KVM MTRR removal version) for the next posting of the rest. The reason being Sean was pointing that it is more architecturally correct given that the CPUID bit is exposed. So we will need that regardless of the guest solution. But it would probably would be good to update this before re-posting: https://lore.kernel.org/kvm/20241210004946.3718496-19-binbin.wu@linux.intel.com/#t Given the last one got hardly any comments and the mostly recent patches are already in kvm-coco-queue, I say we try to review that version a bit more. This is different then previously discussed. Any objections?
On 2/14/2025 5:41 AM, Edgecombe, Rick P wrote: > On Wed, 2025-02-12 at 10:39 +0800, Binbin Wu wrote: >>> IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to >>> distinguish >>> between a genuine EPT MISCONFIG and a morphed one, and handle them >>> differently? >> It will be handled separately, which will be in the last section of the KVM >> basic support. But the v2 of "the rest" section is on hold because there is >> a discussion related to MTRR MSR handling: >> https://lore.kernel.org/all/20250201005048.657470-1-seanjc@google.com/ >> Want to send the v2 of "the rest" section after the MTRR discussion is >> finalized. > I think we can just put back the original MTRR code (post KVM MTRR removal > version) for the next posting of the rest. The reason being Sean was pointing > that it is more architecturally correct given that the CPUID bit is exposed. So > we will need that regardless of the guest solution. The original MTRR code before removing is: https://lore.kernel.org/kvm/81119d66392bc9446340a16f8a532c7e1b2665a2.1708933498.git.isaku.yamahata@intel.com/ It enforces WB as default memtype and disables fixed/variable range MTRRs. That means this solution doesn't allow guest to use MTRRs as a communication channel if the guest firmware wants to program some ranges to UC for legacy devices. How about to allow TDX guests to access MTRR MSRs as what KVM does for normal VMs? Guest kernels may use MTRRs as a crutch to get the desired memtype for devices. E.g., in most KVM-based setups, legacy devices such as the HPET and TPM are enumerated via ACPI. And in Linux kernel, for unknown reasons, ACPI auto-maps such devices as WB, whereas the dedicated device drivers map memory as WC or UC. The ACPI mappings rely on firmware to configure PCI hole (and other device memory) to be UC in the MTRRs to end up UC-, which is compatible with the drivers' requested WC/UC-. So KVM needs to allow guests to program the desired value in MTRRs in case guests want to use MTRRs as a communication channel between guest firmware and the kernel. Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e., KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR memory types. One open is whether enforce the value of default MTRR memtype as WB. However, TDX disallows toggling CR0.CD. If a TDX guest wants to use MTRRs as the communication channel, it should skip toggling CR0.CD when it programs MTRRs both in guest firmware and guest kernel. For a guest, there is no reason to disable caches because it's in a virtual environment. It makes sense for guest firmware/kernel to skip toggling CR0.CD when it detects it's running as a TDX guest. > > But it would probably would be good to update this before re-posting: > https://lore.kernel.org/kvm/20241210004946.3718496-19-binbin.wu@linux.intel.com/#t > > Given the last one got hardly any comments and the mostly recent patches are > already in kvm-coco-queue, I say we try to review that version a bit more. This > is different then previously discussed. Any objections?
On Fri, 2025-02-14 at 08:47 +0800, Binbin Wu wrote: > > On 2/14/2025 5:41 AM, Edgecombe, Rick P wrote: > > On Wed, 2025-02-12 at 10:39 +0800, Binbin Wu wrote: > > > > IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to > > > > distinguish > > > > between a genuine EPT MISCONFIG and a morphed one, and handle them > > > > differently? > > > It will be handled separately, which will be in the last section of the KVM > > > basic support. But the v2 of "the rest" section is on hold because there is > > > a discussion related to MTRR MSR handling: > > > https://lore.kernel.org/all/20250201005048.657470-1-seanjc@google.com/ > > > Want to send the v2 of "the rest" section after the MTRR discussion is > > > finalized. > > I think we can just put back the original MTRR code (post KVM MTRR removal > > version) for the next posting of the rest. The reason being Sean was pointing > > that it is more architecturally correct given that the CPUID bit is exposed. So > > we will need that regardless of the guest solution. > The original MTRR code before removing is: > https://lore.kernel.org/kvm/81119d66392bc9446340a16f8a532c7e1b2665a2.1708933498.git.isaku.yamahata@intel.com/ > > It enforces WB as default memtype and disables fixed/variable range MTRRs. > That means this solution doesn't allow guest to use MTRRs as a communication > channel if the guest firmware wants to program some ranges to UC for legacy > devices. I'm talking about the internal version that existed after KVM removed MTRRs for normal VMs. We are not talking about adding back KVM MTRRs. > > > How about to allow TDX guests to access MTRR MSRs as what KVM does for > normal VMs? > > Guest kernels may use MTRRs as a crutch to get the desired memtype for devices. > E.g., in most KVM-based setups, legacy devices such as the HPET and TPM are > enumerated via ACPI. And in Linux kernel, for unknown reasons, ACPI auto-maps > such devices as WB, whereas the dedicated device drivers map memory as WC or > UC. The ACPI mappings rely on firmware to configure PCI hole (and other device > memory) to be UC in the MTRRs to end up UC-, which is compatible with the > drivers' requested WC/UC-. > > So KVM needs to allow guests to program the desired value in MTRRs in case > guests want to use MTRRs as a communication channel between guest firmware > and the kernel. > > Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e., > KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR memory > types. One open is whether enforce the value of default MTRR memtype as WB. This is basically what we had previously (internally), right? > > However, TDX disallows toggling CR0.CD. If a TDX guest wants to use MTRRs > as the communication channel, it should skip toggling CR0.CD when it > programs MTRRs both in guest firmware and guest kernel. For a guest, there > is no reason to disable caches because it's in a virtual environment. It > makes sense for guest firmware/kernel to skip toggling CR0.CD when it > detects it's running as a TDX guest. I don't see why we have to tie exposing MTRR to a particular solution for the guest and bios. Let's focus on the work we know we need regardless for KVM.
On 2/14/2025 9:01 AM, Edgecombe, Rick P wrote: > On Fri, 2025-02-14 at 08:47 +0800, Binbin Wu wrote: >> On 2/14/2025 5:41 AM, Edgecombe, Rick P wrote: >>> On Wed, 2025-02-12 at 10:39 +0800, Binbin Wu wrote: >>>>> IIRC, a TD-exit may occur due to an EPT MISCONFIG. Do you need to >>>>> distinguish >>>>> between a genuine EPT MISCONFIG and a morphed one, and handle them >>>>> differently? >>>> It will be handled separately, which will be in the last section of the KVM >>>> basic support. But the v2 of "the rest" section is on hold because there is >>>> a discussion related to MTRR MSR handling: >>>> https://lore.kernel.org/all/20250201005048.657470-1-seanjc@google.com/ >>>> Want to send the v2 of "the rest" section after the MTRR discussion is >>>> finalized. >>> I think we can just put back the original MTRR code (post KVM MTRR removal >>> version) for the next posting of the rest. The reason being Sean was pointing >>> that it is more architecturally correct given that the CPUID bit is exposed. So >>> we will need that regardless of the guest solution. >> The original MTRR code before removing is: >> https://lore.kernel.org/kvm/81119d66392bc9446340a16f8a532c7e1b2665a2.1708933498.git.isaku.yamahata@intel.com/ >> >> It enforces WB as default memtype and disables fixed/variable range MTRRs. >> That means this solution doesn't allow guest to use MTRRs as a communication >> channel if the guest firmware wants to program some ranges to UC for legacy >> devices. > I'm talking about the internal version that existed after KVM removed MTRRs for > normal VMs. We are not talking about adding back KVM MTRRs. Sorry, I misunderstood it. > >> >> How about to allow TDX guests to access MTRR MSRs as what KVM does for >> normal VMs? >> >> Guest kernels may use MTRRs as a crutch to get the desired memtype for devices. >> E.g., in most KVM-based setups, legacy devices such as the HPET and TPM are >> enumerated via ACPI. And in Linux kernel, for unknown reasons, ACPI auto-maps >> such devices as WB, whereas the dedicated device drivers map memory as WC or >> UC. The ACPI mappings rely on firmware to configure PCI hole (and other device >> memory) to be UC in the MTRRs to end up UC-, which is compatible with the >> drivers' requested WC/UC-. >> >> So KVM needs to allow guests to program the desired value in MTRRs in case >> guests want to use MTRRs as a communication channel between guest firmware >> and the kernel. >> >> Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e., >> KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR memory >> types. One open is whether enforce the value of default MTRR memtype as WB. > This is basically what we had previously (internally), right? Yes. Then we are aligned. :) > >> However, TDX disallows toggling CR0.CD. If a TDX guest wants to use MTRRs >> as the communication channel, it should skip toggling CR0.CD when it >> programs MTRRs both in guest firmware and guest kernel. For a guest, there >> is no reason to disable caches because it's in a virtual environment. It >> makes sense for guest firmware/kernel to skip toggling CR0.CD when it >> detects it's running as a TDX guest. > I don't see why we have to tie exposing MTRR to a particular solution for the > guest and bios. Let's focus on the work we know we need regardless for KVM. Guest could choose to use MTRRs or other SW protocal to communicate the memtype for devices. I just wanted to point it out that if guest chooses to use MTRRs as the communicate channel, it will face the #VE issue caused by toggleing CR0.CD.
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index f13da28dd4a2..8f3147c6e602 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -849,8 +849,12 @@ static __always_inline u32 tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu) if (tdvmcall_exit_type(vcpu)) return EXIT_REASON_VMCALL; - if (tdvmcall_leaf(vcpu) < 0x10000) + if (tdvmcall_leaf(vcpu) < 0x10000) { + if (tdvmcall_leaf(vcpu) == EXIT_REASON_EPT_VIOLATION) + return EXIT_REASON_EPT_MISCONFIG; + return tdvmcall_leaf(vcpu); + } break; default: break; @@ -1193,6 +1197,107 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu) return ret; } +static int tdx_complete_mmio_read(struct kvm_vcpu *vcpu) +{ + unsigned long val = 0; + gpa_t gpa; + int size; + + gpa = vcpu->mmio_fragments[0].gpa; + size = vcpu->mmio_fragments[0].len; + + memcpy(&val, vcpu->run->mmio.data, size); + tdvmcall_set_return_val(vcpu, val); + trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val); + return 1; +} + +static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size, + unsigned long val) +{ + if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) { + trace_kvm_fast_mmio(gpa); + return 0; + } + + trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val); + if (kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val)) + return -EOPNOTSUPP; + + return 0; +} + +static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size) +{ + unsigned long val; + + if (kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val)) + return -EOPNOTSUPP; + + tdvmcall_set_return_val(vcpu, val); + trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val); + return 0; +} + +static int tdx_emulate_mmio(struct kvm_vcpu *vcpu) +{ + struct vcpu_tdx *tdx = to_tdx(vcpu); + int size, write, r; + unsigned long val; + gpa_t gpa; + + size = tdx->vp_enter_args.r12; + write = tdx->vp_enter_args.r13; + gpa = tdx->vp_enter_args.r14; + val = write ? tdx->vp_enter_args.r15 : 0; + + if (size != 1 && size != 2 && size != 4 && size != 8) + goto error; + if (write != 0 && write != 1) + goto error; + + /* + * TDG.VP.VMCALL<MMIO> allows only shared GPA, it makes no sense to + * do MMIO emulation for private GPA. + */ + if (vt_is_tdx_private_gpa(vcpu->kvm, gpa) || + vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1)) + goto error; + + gpa = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm)); + + if (write) + r = tdx_mmio_write(vcpu, gpa, size, val); + else + r = tdx_mmio_read(vcpu, gpa, size); + if (!r) + /* Kernel completed device emulation. */ + return 1; + + /* Request the device emulation to userspace device model. */ + vcpu->mmio_is_write = write; + if (!write) + vcpu->arch.complete_userspace_io = tdx_complete_mmio_read; + + vcpu->run->mmio.phys_addr = gpa; + vcpu->run->mmio.len = size; + vcpu->run->mmio.is_write = write; + vcpu->run->exit_reason = KVM_EXIT_MMIO; + + if (write) { + memcpy(vcpu->run->mmio.data, &val, size); + } else { + vcpu->mmio_fragments[0].gpa = gpa; + vcpu->mmio_fragments[0].len = size; + trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL); + } + return 0; + +error: + tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_INVALID_OPERAND); + return 1; +} + static int handle_tdvmcall(struct kvm_vcpu *vcpu) { switch (tdvmcall_leaf(vcpu)) { @@ -1546,6 +1651,8 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath) return tdx_emulate_vmcall(vcpu); case EXIT_REASON_IO_INSTRUCTION: return tdx_emulate_io(vcpu); + case EXIT_REASON_EPT_MISCONFIG: + return tdx_emulate_mmio(vcpu); default: break; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 29f33f7c9da9..a41d57ba4a86 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -14010,6 +14010,7 @@ EXPORT_SYMBOL_GPL(kvm_sev_es_string_io); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 836e0c69f53b..783683d04939 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5835,6 +5835,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, r = __kvm_io_bus_read(vcpu, bus, &range, val); return r < 0 ? r : 0; } +EXPORT_SYMBOL_GPL(kvm_io_bus_read); int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, int len, struct kvm_io_device *dev)