Message ID | 20240923141810.76331-1-iorlov@amazon.com (mailing list archive) |
---|---|
Headers | show |
Series | Process some MMIO-related errors without KVM exit | expand |
On Mon, Sep 23, 2024, Ivan Orlov wrote: > Currently, KVM may return a variety of internal errors to VMM when > accessing MMIO, and some of them could be gracefully handled on the KVM > level instead. Moreover, some of the MMIO-related errors are handled > differently in VMX in comparison with SVM, which produces certain > inconsistency and should be fixed. This patch series introduces > KVM-level handling for the following situations: > > 1) Guest is accessing MMIO during event delivery: triple fault instead > of internal error on VMX and infinite loop on SVM > > 2) Guest fetches an instruction from MMIO: inject #UD and resume guest > execution without internal error No. This is not architectural behavior. It's not even remotely close to architectural behavior. KVM's behavior isn't great, but making up _guest visible_ behavior is not going to happen.
On Mon, 2024-09-23 at 10:04 -0700, Sean Christopherson wrote: > > On Mon, Sep 23, 2024, Ivan Orlov wrote: > > Currently, KVM may return a variety of internal errors to VMM when > > accessing MMIO, and some of them could be gracefully handled on the > > KVM > > level instead. Moreover, some of the MMIO-related errors are > > handled > > differently in VMX in comparison with SVM, which produces certain > > inconsistency and should be fixed. This patch series introduces > > KVM-level handling for the following situations: > > > > 1) Guest is accessing MMIO during event delivery: triple fault > > instead > > of internal error on VMX and infinite loop on SVM > > > > 2) Guest fetches an instruction from MMIO: inject #UD and resume > > guest > > execution without internal error > > No. This is not architectural behavior. It's not even remotely > close to > architectural behavior. KVM's behavior isn't great, but making up > _guest visible_ > behavior is not going to happen. Is this a no to the whole series or from the cover letter? For patch 1 we have observed that if a guest has incorrectly set it's IDT base to point inside of an MMIO region it will result in a triple fault (bare metal Cascake Lake Intel). Yes a sane operating system is not really going to be doing setting it's IDT or GDT base to point into an MMIO region, but we've seen occurrences. Normally when other external things have gone horribly wrong. Ivan can clarify as to what's been seen on AMD platforms regarding the infinite loop for patch one. This was also tested on bare metal hardware. Injection of the #UD within patch 2 may be debatable but I believe Ivan has some more data from experiments backing this up. Best regards, Jack
On Mon, Sep 23, 2024, Jack Allister wrote: > On Mon, 2024-09-23 at 10:04 -0700, Sean Christopherson wrote: > > > > On Mon, Sep 23, 2024, Ivan Orlov wrote: > > > Currently, KVM may return a variety of internal errors to VMM when > > > accessing MMIO, and some of them could be gracefully handled on the > > > KVM > > > level instead. Moreover, some of the MMIO-related errors are > > > handled > > > differently in VMX in comparison with SVM, which produces certain > > > inconsistency and should be fixed. This patch series introduces > > > KVM-level handling for the following situations: > > > > > > 1) Guest is accessing MMIO during event delivery: triple fault > > > instead > > > of internal error on VMX and infinite loop on SVM > > > > > > 2) Guest fetches an instruction from MMIO: inject #UD and resume > > > guest > > > execution without internal error > > > > No. This is not architectural behavior. It's not even remotely > > close to > > architectural behavior. KVM's behavior isn't great, but making up > > _guest visible_ > > behavior is not going to happen. > > Is this a no to the whole series or from the cover letter? The whole series. > For patch 1 we have observed that if a guest has incorrectly set it's > IDT base to point inside of an MMIO region it will result in a triple > fault (bare metal Cascake Lake Intel). That happens because the IDT is garbage and/or the CPU is getting master abort semantics back, not because anything in the x86 architectures says that accessing MMIO during exception vectoring goes straight to shutdown. > Yes a sane operating system is not really going to be doing setting it's IDT > or GDT base to point into an MMIO region, but we've seen occurrences. > Normally when other external things have gone horribly wrong. > > Ivan can clarify as to what's been seen on AMD platforms regarding the > infinite loop for patch one. This was also tested on bare metal > hardware. Injection of the #UD within patch 2 may be debatable but I > believe Ivan has some more data from experiments backing this up. I have no problems improving KVM's handling of scenarios that KVM can't emulate, but there needs to be reasonable justification for taking on complexity, and KVM must not make up guest visible behavior.
On Mon, Sep 23, 2024, Jack Allister wrote: > On Mon, 2024-09-23 at 10:04 -0700, Sean Christopherson wrote: > > > > On Mon, Sep 23, 2024, Ivan Orlov wrote: > > > Currently, KVM may return a variety of internal errors to VMM when > > > accessing MMIO, and some of them could be gracefully handled on the > > > KVM > > > level instead. Moreover, some of the MMIO-related errors are > > > handled > > > differently in VMX in comparison with SVM, which produces certain > > > inconsistency and should be fixed. This patch series introduces > > > KVM-level handling for the following situations: > > > > > > 1) Guest is accessing MMIO during event delivery: triple fault > > > instead > > > of internal error on VMX and infinite loop on SVM > > > > > > 2) Guest fetches an instruction from MMIO: inject #UD and resume > > > guest > > > execution without internal error > > > > No. This is not architectural behavior. It's not even remotely close to > > architectural behavior. KVM's behavior isn't great, but making up _guest > > visible_ behavior is not going to happen. > > Is this a no to the whole series or from the cover letter? The whole series. > For patch 1 we have observed that if a guest has incorrectly set it's > IDT base to point inside of an MMIO region it will result in a triple > fault (bare metal Cascake Lake Intel). The triple fault occurs because the MMIO read returns garbage, e.g. because it gets back master abort semantics. > Yes a sane operating system is not really going to be doing setting it's IDT > or GDT base to point into an MMIO region, but we've seen occurrences. Sure, but that doesn't make it architecturally correct to synthesize arbitrary faults. > Normally when other external things have gone horribly wrong. > > Ivan can clarify as to what's been seen on AMD platforms regarding the > infinite loop for patch one. So it sounds like what you really want to do is not put the vCPU into an infinite loop. Have you tried kvm/next or kvm-x86/next, which has fixes for infinite loops on TDP faults? Specifically, these commits: 98a69b96caca3e07aff57ca91fd7cc3a3853871a KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn d859b16161c81ee929b7b02a85227b8e3250bc97 KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list 6b3dcabc10911711eba15816d808e2a18f130406 KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version 2876624e1adcd9a3a3ffa8c4fe3bf8dbba969d95 KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() 4df685664bed04794ad72b58d8af1fa4fcc60261 KVM: x86: Update retry protection fields when forcing retry on emulation failure dabc4ff70c35756bc107bc5d035d0f0746396a9a KVM: x86: Apply retry protection to "unprotect on failure" path 19ab2c8be070160be70a88027b3b93106fef7b89 KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn 620525739521376a65a690df899e1596d56791f8 KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulation b299c273c06f005976cdc1b9e9299d492527607e KVM: x86/mmu: Move event re-injection unprotect+retry into common path 29e495bdf847ac6ad0e0d03e5db39a3ed9f12858 KVM: x86/mmu: Always walk guest PTEs with WRITE access when unprotecting b7e948898e772ac900950c0dac4ca90e905cd0c0 KVM: x86/mmu: Don't try to unprotect an INVALID_GPA 2df354e37c1398a85bb43cbbf1f913eb3f91d035 KVM: x86: Fold retry_instruction() into x86_emulate_instruction() 41e6e367d576ce1801dc5c2b106e14cde35e3c80 KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction() dfaae8447c53819749cf3ba10ce24d3c609752e3 KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs 01dd4d319207c4cfd51a1c9a1812909e944d8c86 KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path 9c19129e535bfff85bdfcb5a804e19e5aae935b2 KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for retry 019f3f84a40c88b68ca4d455306b92c20733e784 KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip c1edcc41c3603c65f34000ae031a20971f4e56f9 KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is zapped 2fb2b7877b3a4cac4de070ef92437b38f13559b0 KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected 989a84c93f592e6b288fb3b96d2eeec827d75bef KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults 4ececec19a0914873634ad69bbaca5557c33e855 KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper > This was also tested on bare metal hardware. Injection of the #UD within > patch 2 may be debatable but I believe Ivan has some more data from > experiments backing this up. Heh, it's not debatable. Fetching from MMIO is perfectly legal. Again, any #UD you see on bare metal is all but guaranteed to be due to fetching garbage.
On Mon, Sep 23, 2024 at 02:46:17PM -0700, Sean Christopherson wrote: > > > > > No. This is not architectural behavior. It's not even remotely > > > close to > > > architectural behavior. KVM's behavior isn't great, but making up > > > _guest visible_ > > > behavior is not going to happen. > > > > Is this a no to the whole series or from the cover letter? > > The whole series. > > > For patch 1 we have observed that if a guest has incorrectly set it's > > IDT base to point inside of an MMIO region it will result in a triple > > fault (bare metal Cascake Lake Intel). > > That happens because the IDT is garbage and/or the CPU is getting master abort > semantics back, not because anything in the x86 architectures says that accessing > MMIO during exception vectoring goes straight to shutdown. > Hi Sean, thank you for the detailed reply. Yes, I ran the reproducer on my AMD Ryzen 5 once again, and it seems like pointing the IDT descriptor base to a framebuffer works perfectly fine, without any triple faults, so injecting it into guest is not a correct solution. However, I believe KVM should demonstrate consistent behaviour in handling MMIO during event delivery on vmx/svm, so either by returning an event delivery error in both cases or going into infinite loop in both cases. I'm going to test the kvm/next with the commits you mentioned, and send a fix if it still hits an infinite loop on SVM. Also, I reckon as detecting such an issue on the KVM level doesn't introduce much complexity, returning a sane error flag would be nice. For instance, we could set one of the 'internal.data' elements to identify that the problem occured due to MMIO during event delivery > > Yes a sane operating system is not really going to be doing setting it's IDT > > or GDT base to point into an MMIO region, but we've seen occurrences. > > Normally when other external things have gone horribly wrong. > > > > Ivan can clarify as to what's been seen on AMD platforms regarding the > > infinite loop for patch one. This was also tested on bare metal > > hardware. Injection of the #UD within patch 2 may be debatable but I > > believe Ivan has some more data from experiments backing this up. > > I have no problems improving KVM's handling of scenarios that KVM can't emulate, > but there needs to be reasonable justification for taking on complexity, and KVM > must not make up guest visible behavior. Regarding the #UD-related change: the way how I formulated it in the cover letter and the patch is confusing, sorry. We are _alredy_ enqueuing an #UD when fetching from MMIO, so I believe it is already architecturally incorrect (see handle_emulation_failure in arch/x86/kvm/x86.c). However, we return an emulation failure after that, and I believe how we do this is debatable. I maintain we should either set a flag in emulation_failure.flags to indicate that the error happened due to fetch from mmio (to give more information to VMM), or we shouldn't return an error at all... Maybe it should be KVM_EXIT_MMIO with some flag set? What do you think? Thank you! Kind regards, Ivan Orlov
On Tue, Sep 24, 2024, Ivan Orlov wrote: > On Mon, Sep 23, 2024 at 02:46:17PM -0700, Sean Christopherson wrote: > > > > > > > No. This is not architectural behavior. It's not even remotely > > > > close to > > > > architectural behavior. KVM's behavior isn't great, but making up > > > > _guest visible_ > > > > behavior is not going to happen. > > > > > > Is this a no to the whole series or from the cover letter? > > > > The whole series. > > > > > For patch 1 we have observed that if a guest has incorrectly set it's > > > IDT base to point inside of an MMIO region it will result in a triple > > > fault (bare metal Cascake Lake Intel). > > > > That happens because the IDT is garbage and/or the CPU is getting master abort > > semantics back, not because anything in the x86 architectures says that accessing > > MMIO during exception vectoring goes straight to shutdown. > > > > Hi Sean, thank you for the detailed reply. > > Yes, I ran the reproducer on my AMD Ryzen 5 once again, and it seems like > pointing the IDT descriptor base to a framebuffer works perfectly fine, > without any triple faults, so injecting it into guest is not a correct > solution. > > However, I believe KVM should demonstrate consistent behaviour in > handling MMIO during event delivery on vmx/svm, so either by returning > an event delivery error in both cases or going into infinite loop in > both cases. I'm going to test the kvm/next with the commits you > mentioned, and send a fix if it still hits an infinite loop on SVM. > > Also, I reckon as detecting such an issue on the KVM level doesn't > introduce much complexity, returning a sane error flag would be nice. For > instance, we could set one of the 'internal.data' elements to identify > that the problem occured due to MMIO during event delivery > > > > Yes a sane operating system is not really going to be doing setting it's IDT > > > or GDT base to point into an MMIO region, but we've seen occurrences. > > > Normally when other external things have gone horribly wrong. > > > > > > Ivan can clarify as to what's been seen on AMD platforms regarding the > > > infinite loop for patch one. This was also tested on bare metal > > > hardware. Injection of the #UD within patch 2 may be debatable but I > > > believe Ivan has some more data from experiments backing this up. > > > > I have no problems improving KVM's handling of scenarios that KVM can't emulate, > > but there needs to be reasonable justification for taking on complexity, and KVM > > must not make up guest visible behavior. > > Regarding the #UD-related change: the way how I formulated it in the > cover letter and the patch is confusing, sorry. We are _alredy_ enqueuing > an #UD when fetching from MMIO, so I believe it is already architecturally > incorrect (see handle_emulation_failure in arch/x86/kvm/x86.c). However, > we return an emulation failure after that, Yeah, but only because the alternative sucks worse. If KVM unconditionally exited with an emulation error, then unsuspecting (read: old) VMMs would likely terminate the guest, which gives guest userspace a way to DoS the entire VM, especially on older CPUs where KVM needs to emulate much more often. if (kvm->arch.exit_on_emulation_error || (emulation_type & EMULTYPE_SKIP)) { prepare_emulation_ctxt_failure_exit(vcpu); return 0; } kvm_queue_exception(vcpu, UD_VECTOR); if (!is_guest_mode(vcpu) && kvm_x86_call(get_cpl)(vcpu) == 0) { prepare_emulation_ctxt_failure_exit(vcpu); return 0; } return 1; And that's exactly why KVM_CAP_EXIT_ON_EMULATION_FAILURE exists. VMMs that know they won't unintentionally give guest userspace what amounts to a privilege escalation can trap the emulation failure, do some logging or whatever, and then take whatever action it wants to take. > and I believe how we do this > is debatable. I maintain we should either set a flag in emulation_failure.flags > to indicate that the error happened due to fetch from mmio (to give more > information to VMM), Generally speaking, I'm not opposed to adding more information along those lines. Though realistically, I don't know that an extra flag is warranted in this case, as it shouldn't be _that_ hard for userspace to deduce what went wrong, especially if KVM_TRANSLATE2[*] lands (though I'm somewhat curious as to why QEMU doesn't do the page walks itself). [*] https://lore.kernel.org/all/20240910152207.38974-1-nikwip@amazon.de > or we shouldn't return an error at all... Maybe it should be KVM_EXIT_MMIO with > some flag set? What do you think? It'd be a breaking change and added complexity, for no benefit as far as I can tell. KVM_EXIT_INTERNAL_ERROR is _not_ a death sentence, or at least it doesn't have to be. Most VMMs do terminate the guest, but nothing is stopping userspace from grabbing RIP and emulating the instruction. I.e. userspace doesn't need "permission" in the form of KVM_EXIT_MMIO to try and keep the guest alive.
On Wed, Sep 25, 2024 at 05:06:45PM -0700, Sean Christopherson wrote: > > Yeah, but only because the alternative sucks worse. If KVM unconditionally exited > with an emulation error, then unsuspecting (read: old) VMMs would likely terminate > the guest, which gives guest userspace a way to DoS the entire VM, especially on > older CPUs where KVM needs to emulate much more often. > > if (kvm->arch.exit_on_emulation_error || > (emulation_type & EMULTYPE_SKIP)) { > prepare_emulation_ctxt_failure_exit(vcpu); > return 0; > } > > kvm_queue_exception(vcpu, UD_VECTOR); > > if (!is_guest_mode(vcpu) && kvm_x86_call(get_cpl)(vcpu) == 0) { > prepare_emulation_ctxt_failure_exit(vcpu); > return 0; > } > > return 1; > > And that's exactly why KVM_CAP_EXIT_ON_EMULATION_FAILURE exists. VMMs that know > they won't unintentionally give guest userspace what amounts to a privilege > escalation can trap the emulation failure, do some logging or whatever, and then > take whatever action it wants to take. > Hi Sean, Makes sense, thank you for the explanation. > > and I believe how we do this > > is debatable. I maintain we should either set a flag in emulation_failure.flags > > to indicate that the error happened due to fetch from mmio (to give more > > information to VMM), > > Generally speaking, I'm not opposed to adding more information along those lines. > Though realistically, I don't know that an extra flag is warranted in this case, > as it shouldn't be _that_ hard for userspace to deduce what went wrong, especially > if KVM_TRANSLATE2[*] lands (though I'm somewhat curious as to why QEMU doesn't do > the page walks itself). > > [*] https://lore.kernel.org/all/20240910152207.38974-1-nikwip@amazon.de > Fair enough, but I still believe that it would be good to provide more information about the failure to the VMM (considering the fact that KVM tries to emulate an instruction anyway, adding a flag won't introduce any performance overhead). I'll think about how we could do that without being redundant :) > > or we shouldn't return an error at all... Maybe it should be KVM_EXIT_MMIO with > > some flag set? What do you think? > > It'd be a breaking change and added complexity, for no benefit as far as I can > tell. KVM_EXIT_INTERNAL_ERROR is _not_ a death sentence, or at least it doesn't > have to be. Most VMMs do terminate the guest, but nothing is stopping userspace > from grabbing RIP and emulating the instruction. I.e. userspace doesn't need > "permission" in the form of KVM_EXIT_MMIO to try and keep the guest alive. Yeah, I just thought that "internal error" is not the best exit code for the situations when guest fetches from MMIO (since it is a perfectly legal operation from the architectural point of view). But I agree that it would be a breaking change without functional benefit ( especially if we provide a flag about what happened :) ). P.S. I tested the latest kvm/next, and if we set the IDT descriptor base to an MMIO address it still falls into the infinite loop on SVM. I'm going to send the fix in the next couple of days. Kind regards, Ivan Orlov