Message ID | 20170816112249.28939-1-pbonzini@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
2017-08-16 13:22+0200, Paolo Bonzini: > Microsoft pointed out privately to me that KVM's handling of > KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid > in EPT misconfiguration vmexit handlers, because neither EPT violations > nor misconfigurations are listed in the manual among the VM exits that > set the VM-exit instruction length field. > > While physical processors seem to set the field, this is not architectural > and is just a side effect of the implementation. I couldn't convince > myself of any condition on the exit qualification where VM-exit > instruction length "has" to be defined; there are no trap-like VM-exits > that can be repurposed; and fault-like VM-exits such as descriptor-table > exits provide no decoding information. So I don't really see any elegant > way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio > 1 will go slower. Do you have some numbers? We could keep the ugliness in KVM and add a new skip function with emulate_instruction(vcpu, EMULTYPE_SKIP) to decode the length of the instruction. (Adding a condition just for EPT violation exit reason to the existing skip function would be a dirtier solution.) Slower than what we have now, but faster than full emulation. I agree that configuring EPT to throw a violation when accessing fast MMIO has many drawbacks (although it seems to be what Intel expected). Thanks. > Adding a hypercall or MSR write that does a fast MMIO write to a physical > address would do it, but it adds hypervisor knowledge in virtio, including > CPUID handling. So it would be pretty ugly in the guest-side implementation, > but if somebody wants to do it and the virtio side is acceptable to the > virtio maintainers, I am okay with it. > > Cc: Michael S. Tsirkin <mst@redhat.com> > Cc: Radim Krčmář <rkrcmar@redhat.com> > Cc: stable@vger.kernel.org > Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > arch/x86/kvm/vmx.c | 5 ----- > 1 file changed, 5 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 375dca24cf42..b3eaeb20670d 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -6320,11 +6320,6 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) > gpa_t gpa; > > gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); > - if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) { > - trace_kvm_fast_mmio(gpa); > - return kvm_skip_emulated_instruction(vcpu); > - } > - > ret = handle_mmio_page_fault(vcpu, gpa, true); > vcpu->arch.gpa_available = true; > if (likely(ret == RET_MMIO_PF_EMULATE)) > -- > 2.13.5 >
On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote: > Microsoft pointed out privately to me that KVM's handling of > KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid > in EPT misconfiguration vmexit handlers, because neither EPT violations > nor misconfigurations are listed in the manual among the VM exits that > set the VM-exit instruction length field. > > While physical processors seem to set the field, this is not architectural > and is just a side effect of the implementation. I couldn't convince > myself of any condition on the exit qualification where VM-exit > instruction length "has" to be defined; there are no trap-like VM-exits > that can be repurposed; and fault-like VM-exits such as descriptor-table > exits provide no decoding information. So I don't really see any elegant > way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio > 1 will go slower. How about I will try asking Intel about it? If they can commit to length being there in the future, we are all set. > Adding a hypercall or MSR write that does a fast MMIO write to a physical > address would do it, but it adds hypervisor knowledge in virtio, including > CPUID handling. Another issue is that it will break DPDK on virtio. > So it would be pretty ugly in the guest-side implementation, > but if somebody wants to do it and the virtio side is acceptable to the > virtio maintainers, I am okay with it. > > Cc: Michael S. Tsirkin <mst@redhat.com> > Cc: Radim Krčmář <rkrcmar@redhat.com> > Cc: stable@vger.kernel.org > Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Hmm that's quite unfortunate as we have just completed rolling out MMIO signalling across the board. We did measure a significant slowdown before enabling fast mmio. Guest TX:(TCP) size/session/+throughput%/+cpu%/-+per cpu%/ 64/1/[+18.9183%]/-0.2823%/[+19.2550%]/ 64/2/[+13.5714%]/[+2.2675%]/[+11.0533%]/ 64/4/[+13.1070%]/[+2.1817%]/[+10.6920%]/ 64/8/[+13.0426%]/[+2.0887%]/[+10.7299%]/ 256/1/[+36.2761%]/+6.3434%/[+28.1471%]/ ... 1024/1/[+44.8873%]/+2.0811%/[+41.9335%]/ ... 1024/4/+0.0228%/[-2.2044%]/[+2.2774%]/ ... 16384/2/+0.0127%/[-5.0346%]/[+5.3148%]/ ... 65535/1/[+0.0062%]/[-4.1183%]/[+4.3017%]/ 65535/2/+0.0004%/[-4.2311%]/[+4.4185%]/ 65535/4/+0.0107%/[-4.6106%]/[+4.8446%]/ 65535/8/-0.0090%/[-5.5178%]/[+5.8306%]/ See commit bc85ccfdf5cc045588f665c84b5707d7364c8a6c for more numbers. > --- > arch/x86/kvm/vmx.c | 5 ----- > 1 file changed, 5 deletions(-) > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 375dca24cf42..b3eaeb20670d 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -6320,11 +6320,6 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) > gpa_t gpa; > > gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); > - if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) { > - trace_kvm_fast_mmio(gpa); > - return kvm_skip_emulated_instruction(vcpu); > - } > - > ret = handle_mmio_page_fault(vcpu, gpa, true); > vcpu->arch.gpa_available = true; > if (likely(ret == RET_MMIO_PF_EMULATE)) > -- > 2.13.5
On 16/08/2017 14:58, Michael S. Tsirkin wrote: > On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote: >> Microsoft pointed out privately to me that KVM's handling of >> KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid >> in EPT misconfiguration vmexit handlers, because neither EPT violations >> nor misconfigurations are listed in the manual among the VM exits that >> set the VM-exit instruction length field. >> >> While physical processors seem to set the field, this is not architectural >> and is just a side effect of the implementation. I couldn't convince >> myself of any condition on the exit qualification where VM-exit >> instruction length "has" to be defined; there are no trap-like VM-exits >> that can be repurposed; and fault-like VM-exits such as descriptor-table >> exits provide no decoding information. So I don't really see any elegant >> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio >> 1 will go slower. > > How about I will try asking Intel about it? If they can commit to length > being there in the future, we are all set. Nope, "I couldn't convince myself of any condition on the exit qualification where VM-exit instruction length "has" to be defined". So assuming Intel can do it, it would only apply to future processors (2 years+ for server SKUs). Plus of course it wouldn't be guaranteed to work on nested. >> Adding a hypercall or MSR write that does a fast MMIO write to a physical >> address would do it, but it adds hypervisor knowledge in virtio, including >> CPUID handling. > > Another issue is that it will break DPDK on virtio. Not break, just make it slower. Paolo > Hmm that's quite unfortunate as we have just completed rolling out MMIO > signalling across the board. We did measure a significant slowdown > before enabling fast mmio. > > Guest TX:(TCP) > size/session/+throughput%/+cpu%/-+per cpu%/ > 64/1/[+18.9183%]/-0.2823%/[+19.2550%]/ > 64/2/[+13.5714%]/[+2.2675%]/[+11.0533%]/ > 64/4/[+13.1070%]/[+2.1817%]/[+10.6920%]/ > 64/8/[+13.0426%]/[+2.0887%]/[+10.7299%]/ > 256/1/[+36.2761%]/+6.3434%/[+28.1471%]/ > ... > 1024/1/[+44.8873%]/+2.0811%/[+41.9335%]/ > ... > 1024/4/+0.0228%/[-2.2044%]/[+2.2774%]/ > ... > 16384/2/+0.0127%/[-5.0346%]/[+5.3148%]/ > ... > 65535/1/[+0.0062%]/[-4.1183%]/[+4.3017%]/ > 65535/2/+0.0004%/[-4.2311%]/[+4.4185%]/ > 65535/4/+0.0107%/[-4.6106%]/[+4.8446%]/ > 65535/8/-0.0090%/[-5.5178%]/[+5.8306%]/ > > > See commit bc85ccfdf5cc045588f665c84b5707d7364c8a6c for more numbers. > > >> --- >> arch/x86/kvm/vmx.c | 5 ----- >> 1 file changed, 5 deletions(-) >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >> index 375dca24cf42..b3eaeb20670d 100644 >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -6320,11 +6320,6 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) >> gpa_t gpa; >> >> gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); >> - if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) { >> - trace_kvm_fast_mmio(gpa); >> - return kvm_skip_emulated_instruction(vcpu); >> - } >> - >> ret = handle_mmio_page_fault(vcpu, gpa, true); >> vcpu->arch.gpa_available = true; >> if (likely(ret == RET_MMIO_PF_EMULATE)) >> -- >> 2.13.5
On Wed, Aug 16, 2017 at 03:05:51PM +0200, Paolo Bonzini wrote: > On 16/08/2017 14:58, Michael S. Tsirkin wrote: > > On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote: > >> Microsoft pointed out privately to me that KVM's handling of > >> KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid > >> in EPT misconfiguration vmexit handlers, because neither EPT violations > >> nor misconfigurations are listed in the manual among the VM exits that > >> set the VM-exit instruction length field. > >> > >> While physical processors seem to set the field, this is not architectural > >> and is just a side effect of the implementation. I couldn't convince > >> myself of any condition on the exit qualification where VM-exit > >> instruction length "has" to be defined; there are no trap-like VM-exits > >> that can be repurposed; and fault-like VM-exits such as descriptor-table > >> exits provide no decoding information. So I don't really see any elegant > >> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio > >> 1 will go slower. > > > > How about I will try asking Intel about it? If they can commit to length > > being there in the future, we are all set. > > Nope, "I couldn't convince myself of any condition on the exit > qualification where VM-exit instruction length "has" to be defined". So > assuming Intel can do it, it would only apply to future processors (2 > years+ for server SKUs). Well maybe there's a reason it's actually working. Let's see what can be done. > > Plus of course it wouldn't be guaranteed to work on nested. Not sure I got this one. > >> Adding a hypercall or MSR write that does a fast MMIO write to a physical > >> address would do it, but it adds hypervisor knowledge in virtio, including > >> CPUID handling. > > > > Another issue is that it will break DPDK on virtio. > > Not break, just make it slower. I thought hypercalls can only be triggered from ring 0, userspace can't call them. Dod I get it wrong? > Paolo > > > Hmm that's quite unfortunate as we have just completed rolling out MMIO > > signalling across the board. We did measure a significant slowdown > > before enabling fast mmio. > > > > Guest TX:(TCP) > > size/session/+throughput%/+cpu%/-+per cpu%/ > > 64/1/[+18.9183%]/-0.2823%/[+19.2550%]/ > > 64/2/[+13.5714%]/[+2.2675%]/[+11.0533%]/ > > 64/4/[+13.1070%]/[+2.1817%]/[+10.6920%]/ > > 64/8/[+13.0426%]/[+2.0887%]/[+10.7299%]/ > > 256/1/[+36.2761%]/+6.3434%/[+28.1471%]/ > > ... > > 1024/1/[+44.8873%]/+2.0811%/[+41.9335%]/ > > ... > > 1024/4/+0.0228%/[-2.2044%]/[+2.2774%]/ > > ... > > 16384/2/+0.0127%/[-5.0346%]/[+5.3148%]/ > > ... > > 65535/1/[+0.0062%]/[-4.1183%]/[+4.3017%]/ > > 65535/2/+0.0004%/[-4.2311%]/[+4.4185%]/ > > 65535/4/+0.0107%/[-4.6106%]/[+4.8446%]/ > > 65535/8/-0.0090%/[-5.5178%]/[+5.8306%]/ > > > > > > See commit bc85ccfdf5cc045588f665c84b5707d7364c8a6c for more numbers. > > > > > >> --- > >> arch/x86/kvm/vmx.c | 5 ----- > >> 1 file changed, 5 deletions(-) > >> > >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > >> index 375dca24cf42..b3eaeb20670d 100644 > >> --- a/arch/x86/kvm/vmx.c > >> +++ b/arch/x86/kvm/vmx.c > >> @@ -6320,11 +6320,6 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) > >> gpa_t gpa; > >> > >> gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); > >> - if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) { > >> - trace_kvm_fast_mmio(gpa); > >> - return kvm_skip_emulated_instruction(vcpu); > >> - } > >> - > >> ret = handle_mmio_page_fault(vcpu, gpa, true); > >> vcpu->arch.gpa_available = true; > >> if (likely(ret == RET_MMIO_PF_EMULATE)) > >> -- > >> 2.13.5
On 16/08/2017 15:16, Michael S. Tsirkin wrote: > On Wed, Aug 16, 2017 at 03:05:51PM +0200, Paolo Bonzini wrote: >> On 16/08/2017 14:58, Michael S. Tsirkin wrote: >>> On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote: >>>> Microsoft pointed out privately to me that KVM's handling of >>>> KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid >>>> in EPT misconfiguration vmexit handlers, because neither EPT violations >>>> nor misconfigurations are listed in the manual among the VM exits that >>>> set the VM-exit instruction length field. >>>> >>>> While physical processors seem to set the field, this is not architectural >>>> and is just a side effect of the implementation. I couldn't convince >>>> myself of any condition on the exit qualification where VM-exit >>>> instruction length "has" to be defined; there are no trap-like VM-exits >>>> that can be repurposed; and fault-like VM-exits such as descriptor-table >>>> exits provide no decoding information. So I don't really see any elegant >>>> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio >>>> 1 will go slower. >>> >>> How about I will try asking Intel about it? If they can commit to length >>> being there in the future, we are all set. >> >> Nope, "I couldn't convince myself of any condition on the exit >> qualification where VM-exit instruction length "has" to be defined". So >> assuming Intel can do it, it would only apply to future processors (2 >> years+ for server SKUs). > > Well maybe there's a reason it's actually working. Let's see what can > be done. Sure there is. It just happens that the actual condition for VM-exit instruction length being set correctly is "the fault was taken after the accessing instruction has been decoded". But there's no way according to the spec to detect whether that has happened. While you can filter out instruction fetches, that's not enough. A data read could happen because someone pointed the IDT to MMIO area, and who knows what the VM-exit instruction length points to in that case. >> Plus of course it wouldn't be guaranteed to work on nested. > > Not sure I got this one. Not all nested hypervisors are setting the VM-exit instruction length field on EPT violations, since it's documented not to be set. >>>> Adding a hypercall or MSR write that does a fast MMIO write to a physical >>>> address would do it, but it adds hypervisor knowledge in virtio, including >>>> CPUID handling. >>> >>> Another issue is that it will break DPDK on virtio. >> >> Not break, just make it slower. > > I thought hypercalls can only be triggered from ring 0, userspace can't call them. > Dod I get it wrong? That's just a limitation that KVM makes on currently-defined hypercalls. VMCALL causes a vmexit if executed from ring 3. Paolo
On 16/08/2017 14:07, Radim Krčmář wrote: > 2017-08-16 13:22+0200, Paolo Bonzini: >> Microsoft pointed out privately to me that KVM's handling of >> KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid >> in EPT misconfiguration vmexit handlers, because neither EPT violations >> nor misconfigurations are listed in the manual among the VM exits that >> set the VM-exit instruction length field. >> >> While physical processors seem to set the field, this is not architectural >> and is just a side effect of the implementation. I couldn't convince >> myself of any condition on the exit qualification where VM-exit >> instruction length "has" to be defined; there are no trap-like VM-exits >> that can be repurposed; and fault-like VM-exits such as descriptor-table >> exits provide no decoding information. So I don't really see any elegant >> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio >> 1 will go slower. > > Do you have some numbers? Raw number from vmexit.flat on Haswell-EP: mmio-no-eventfd:pci-mem 5793 mmio-wildcard-eventfd:pci-mem 1395 mmio-datamatch-eventfd:pci-mem 2268 So roughly 900 clock cycles. Most of the work is the four memory reads done by x86_decode_insn, three to walk the page tables and one to fetch the instruction. > We could keep the ugliness in KVM and add a new skip function with > emulate_instruction(vcpu, EMULTYPE_SKIP) to decode the length of the > instruction. (Adding a condition just for EPT violation exit reason to > the existing skip function would be a dirtier solution.) > Slower than what we have now, but faster than full emulation. This is actually a good idea, and not ugly at all! The main cost is translating the physical address of the instruction and fetching the bytes, so only 200 clock cycles are saved. However, the eventfd is written before decoding, while full emulation would write it after. So while VCPU thread latency is worse compared to skip_emulated_instruction, latency to the iothread remains small. Paolo
On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote: > On 16/08/2017 15:16, Michael S. Tsirkin wrote: > > On Wed, Aug 16, 2017 at 03:05:51PM +0200, Paolo Bonzini wrote: > >> On 16/08/2017 14:58, Michael S. Tsirkin wrote: > >>> On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote: > >>>> Microsoft pointed out privately to me that KVM's handling of > >>>> KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid > >>>> in EPT misconfiguration vmexit handlers, because neither EPT violations > >>>> nor misconfigurations are listed in the manual among the VM exits that > >>>> set the VM-exit instruction length field. > >>>> > >>>> While physical processors seem to set the field, this is not architectural > >>>> and is just a side effect of the implementation. I couldn't convince > >>>> myself of any condition on the exit qualification where VM-exit > >>>> instruction length "has" to be defined; there are no trap-like VM-exits > >>>> that can be repurposed; and fault-like VM-exits such as descriptor-table > >>>> exits provide no decoding information. So I don't really see any elegant > >>>> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio > >>>> 1 will go slower. > >>> > >>> How about I will try asking Intel about it? If they can commit to length > >>> being there in the future, we are all set. > >> > >> Nope, "I couldn't convince myself of any condition on the exit > >> qualification where VM-exit instruction length "has" to be defined". So > >> assuming Intel can do it, it would only apply to future processors (2 > >> years+ for server SKUs). > > > > Well maybe there's a reason it's actually working. Let's see what can > > be done. > > Sure there is. It just happens that the actual condition for VM-exit > instruction length being set correctly is "the fault was taken after the > accessing instruction has been decoded". But there's no way according > to the spec to detect whether that has happened. > > While you can filter out instruction fetches, that's not enough. A data > read could happen because someone pointed the IDT to MMIO area, and who > knows what the VM-exit instruction length points to in that case. Interesting. We use it with MMIO so this kind of hack is not supposed to work anyway. Does not seem to be worth slowing virtio for. > >> Plus of course it wouldn't be guaranteed to work on nested. > > > > Not sure I got this one. > > Not all nested hypervisors are setting the VM-exit instruction length > field on EPT violations, since it's documented not to be set. I see. We could special case this by testing the hypervisor leaf if we wanted to. > >>>> Adding a hypercall or MSR write that does a fast MMIO write to a physical > >>>> address would do it, but it adds hypervisor knowledge in virtio, including > >>>> CPUID handling. > >>> > >>> Another issue is that it will break DPDK on virtio. > >> > >> Not break, just make it slower. > > > > I thought hypercalls can only be triggered from ring 0, userspace can't call them. > > Dod I get it wrong? > > That's just a limitation that KVM makes on currently-defined hypercalls. > > VMCALL causes a vmexit if executed from ring 3. > > Paolo OK but we will have to build our own protection mechanism then. With MMIO we just used pagetables.
On Wed, Aug 16, 2017 at 03:37:47PM +0200, Paolo Bonzini wrote: > On 16/08/2017 14:07, Radim Krčmář wrote: > > 2017-08-16 13:22+0200, Paolo Bonzini: > >> Microsoft pointed out privately to me that KVM's handling of > >> KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid > >> in EPT misconfiguration vmexit handlers, because neither EPT violations > >> nor misconfigurations are listed in the manual among the VM exits that > >> set the VM-exit instruction length field. > >> > >> While physical processors seem to set the field, this is not architectural > >> and is just a side effect of the implementation. I couldn't convince > >> myself of any condition on the exit qualification where VM-exit > >> instruction length "has" to be defined; there are no trap-like VM-exits > >> that can be repurposed; and fault-like VM-exits such as descriptor-table > >> exits provide no decoding information. So I don't really see any elegant > >> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio > >> 1 will go slower. > > > > Do you have some numbers? > > Raw number from vmexit.flat on Haswell-EP: > > mmio-no-eventfd:pci-mem 5793 > mmio-wildcard-eventfd:pci-mem 1395 > mmio-datamatch-eventfd:pci-mem 2268 > > So roughly 900 clock cycles. Most of the work is the four memory reads > done by x86_decode_insn, three to walk the page tables and one to fetch > the instruction. > > > We could keep the ugliness in KVM and add a new skip function with > > emulate_instruction(vcpu, EMULTYPE_SKIP) to decode the length of the > > instruction. (Adding a condition just for EPT violation exit reason to > > the existing skip function would be a dirtier solution.) > > Slower than what we have now, but faster than full emulation. > > This is actually a good idea, and not ugly at all! The main cost is > translating the physical address of the instruction and fetching the > bytes, so only 200 clock cycles are saved. We actually know what to expect (a write) so we could maybe optimize this some more with a dedicated function just for this. > > However, the eventfd is written before decoding, while full emulation > would write it after. So while VCPU thread latency is worse compared to > skip_emulated_instruction, latency to the iothread remains small. > > Paolo
> We actually know what to expect (a write) so we could maybe > optimize this some more with a dedicated function just for this. We don't know the addressing mode, the size or the source (immediate vs. register), so no. KVM is already doing a single translation and read no matter how long the instruction (unless it cross page boundaries). Paolo > > > > However, the eventfd is written before decoding, while full emulation > > would write it after. So while VCPU thread latency is worse compared to > > skip_emulated_instruction, latency to the iothread remains small. > > > > Paolo >
On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote: > On 16/08/2017 15:16, Michael S. Tsirkin wrote: > > On Wed, Aug 16, 2017 at 03:05:51PM +0200, Paolo Bonzini wrote: > >> On 16/08/2017 14:58, Michael S. Tsirkin wrote: > >>> On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote: > >>>> Microsoft pointed out privately to me that KVM's handling of > >>>> KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid > >>>> in EPT misconfiguration vmexit handlers, because neither EPT violations > >>>> nor misconfigurations are listed in the manual among the VM exits that > >>>> set the VM-exit instruction length field. > >>>> > >>>> While physical processors seem to set the field, this is not architectural > >>>> and is just a side effect of the implementation. I couldn't convince > >>>> myself of any condition on the exit qualification where VM-exit > >>>> instruction length "has" to be defined; there are no trap-like VM-exits > >>>> that can be repurposed; and fault-like VM-exits such as descriptor-table > >>>> exits provide no decoding information. So I don't really see any elegant > >>>> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio > >>>> 1 will go slower. > >>> > >>> How about I will try asking Intel about it? If they can commit to length > >>> being there in the future, we are all set. > >> > >> Nope, "I couldn't convince myself of any condition on the exit > >> qualification where VM-exit instruction length "has" to be defined". So > >> assuming Intel can do it, it would only apply to future processors (2 > >> years+ for server SKUs). > > > > Well maybe there's a reason it's actually working. Let's see what can > > be done. > > Sure there is. It just happens that the actual condition for VM-exit > instruction length being set correctly is "the fault was taken after the > accessing instruction has been decoded". But there's no way according > to the spec to detect whether that has happened. > > While you can filter out instruction fetches, that's not enough. A data > read could happen because someone pointed the IDT to MMIO area, and who > knows what the VM-exit instruction length points to in that case. Thinking more about it, I don't really see how anything legal guest might be doing with virtio would trigger anything but a fault after decoding the instruction. How does skipping instruction even make sense in the example you give? Please note that 1. we have verified the address and we know it's the one that belongs to MMIO of a virtio device 2. only write accesses are allowed by the eventfd or virtio spec guests doing anything else are on their own. Are there any reasonable examples on real hardware? Could you share this pls? > >> Plus of course it wouldn't be guaranteed to work on nested. > > > > Not sure I got this one. > > Not all nested hypervisors are setting the VM-exit instruction length > field on EPT violations, since it's documented not to be set. So that's probably the real issue - nested virt which has to do it in software at extra cost. We already limit this to intel processors, how about we blacklist nested virt for this optimization? I agree it's skating it a bit close to the dangerous edge, but so are other tricks we play with PTEs to speed up MMIO. > >>>> Adding a hypercall or MSR write that does a fast MMIO write to a physical > >>>> address would do it, but it adds hypervisor knowledge in virtio, including > >>>> CPUID handling. > >>> > >>> Another issue is that it will break DPDK on virtio. > >> > >> Not break, just make it slower. > > > > I thought hypercalls can only be triggered from ring 0, userspace can't call them. > > Dod I get it wrong? > > That's just a limitation that KVM makes on currently-defined hypercalls. > > VMCALL causes a vmexit if executed from ring 3. > > Paolo
On 16/08/2017 18:50, Michael S. Tsirkin wrote: > On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote: >> While you can filter out instruction fetches, that's not enough. A data >> read could happen because someone pointed the IDT to MMIO area, and who >> knows what the VM-exit instruction length points to in that case. > > Thinking more about it, I don't really see how anything > legal guest might be doing with virtio would trigger anything > but a fault after decoding the instruction. How does > skipping instruction even make sense in the example you give? There's no such thing as a legal guest. Anything that the hypervisor does, that differs from real hardware, is a possible escalation path. This in fact makes me doubt the EMULTYPE_SKIP patch too. >>>> Plus of course it wouldn't be guaranteed to work on nested. >>> >>> Not sure I got this one. >> >> Not all nested hypervisors are setting the VM-exit instruction length >> field on EPT violations, since it's documented not to be set. > > So that's probably the real issue - nested virt which has to do it > in software at extra cost. We already limit this to intel processors, > how about we blacklist nested virt for this optimization? > > I agree it's skating it a bit close to the dangerous edge, > but so are other tricks we play with PTEs to speed up MMIO. Not at all. Everything else we do is perfectly fine according to the spec, this one isn't. Paolo >>>>>> Adding a hypercall or MSR write that does a fast MMIO write to a physical >>>>>> address would do it, but it adds hypervisor knowledge in virtio, including >>>>>> CPUID handling. >>>>> >>>>> Another issue is that it will break DPDK on virtio. >>>> >>>> Not break, just make it slower. >>> >>> I thought hypercalls can only be triggered from ring 0, userspace can't call them. >>> Dod I get it wrong? >> >> That's just a limitation that KVM makes on currently-defined hypercalls. >> >> VMCALL causes a vmexit if executed from ring 3. >> >> Paolo >
2017-08-16 19:19+0200, Paolo Bonzini: > On 16/08/2017 18:50, Michael S. Tsirkin wrote: >> On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote: >>> While you can filter out instruction fetches, that's not enough. A data >>> read could happen because someone pointed the IDT to MMIO area, and who >>> knows what the VM-exit instruction length points to in that case. >> >> Thinking more about it, I don't really see how anything >> legal guest might be doing with virtio would trigger anything >> but a fault after decoding the instruction. How does >> skipping instruction even make sense in the example you give? > > There's no such thing as a legal guest. Anything that the hypervisor > does, that differs from real hardware, is a possible escalation path. > > This in fact makes me doubt the EMULTYPE_SKIP patch too. The main hack is that we expect EPT misconfig within a given range to be a MMIO NULL write. I think it is fine -- EMULTYPE_SKIP is a common path that should have well tested error paths and, IIUC, virtio doesn't allow any other access, so it is a problem of the guest if a buggy/malicious application can access virtio memory. >>>>> Plus of course it wouldn't be guaranteed to work on nested. >>>> >>>> Not sure I got this one. >>> >>> Not all nested hypervisors are setting the VM-exit instruction length >>> field on EPT violations, since it's documented not to be set. >> >> So that's probably the real issue - nested virt which has to do it >> in software at extra cost. We already limit this to intel processors, Hm, there is no reason to exclude SVM. >> how about we blacklist nested virt for this optimization? Not every hypervisor can be easily detected ... KVM uses standard features and SDM clearly says that the instruction length field is undefined. We only lose performance if we decode the instruction, but piling workarounds creates unexpected corner cases. I still don't see acceptable alternatives to Paolo's solution.
On Wed, Aug 16, 2017 at 07:19:28PM +0200, Paolo Bonzini wrote: > On 16/08/2017 18:50, Michael S. Tsirkin wrote: > > On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote: > >> While you can filter out instruction fetches, that's not enough. A data > >> read could happen because someone pointed the IDT to MMIO area, and who > >> knows what the VM-exit instruction length points to in that case. > > > > Thinking more about it, I don't really see how anything > > legal guest might be doing with virtio would trigger anything > > but a fault after decoding the instruction. How does > > skipping instruction even make sense in the example you give? > > There's no such thing as a legal guest. Anything that the hypervisor > does, that differs from real hardware, is a possible escalation path. Fast MMIO bus devices don't apprear out of thin air. They appear because guest enabled a virtio device. So it is a PV guest and if it doesn't behave according to the virtio spec, it is going to crash. > > This in fact makes me doubt the EMULTYPE_SKIP patch too. > > >>>> Plus of course it wouldn't be guaranteed to work on nested. > >>> > >>> Not sure I got this one. > >> > >> Not all nested hypervisors are setting the VM-exit instruction length > >> field on EPT violations, since it's documented not to be set. > > > > So that's probably the real issue - nested virt which has to do it > > in software at extra cost. We already limit this to intel processors, > > how about we blacklist nested virt for this optimization? > > > > I agree it's skating it a bit close to the dangerous edge, > > but so are other tricks we play with PTEs to speed up MMIO. > > Not at all. Everything else we do is perfectly fine according to the > spec, this one isn't. > > Paolo Virtio MMIO is kind of special in many ways. What happens if I map and try to execute an MMIO BAR? I don't think it will work, will it? > >>>>>> Adding a hypercall or MSR write that does a fast MMIO write to a physical > >>>>>> address would do it, but it adds hypervisor knowledge in virtio, including > >>>>>> CPUID handling. > >>>>> > >>>>> Another issue is that it will break DPDK on virtio. > >>>> > >>>> Not break, just make it slower. > >>> > >>> I thought hypercalls can only be triggered from ring 0, userspace can't call them. > >>> Dod I get it wrong? > >> > >> That's just a limitation that KVM makes on currently-defined hypercalls. > >> > >> VMCALL causes a vmexit if executed from ring 3. > >> > >> Paolo > >
On Wed, Aug 16, 2017 at 09:03:17PM +0200, Radim Krčmář wrote: > 2017-08-16 19:19+0200, Paolo Bonzini: > > On 16/08/2017 18:50, Michael S. Tsirkin wrote: > >> On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote: > >>> While you can filter out instruction fetches, that's not enough. A data > >>> read could happen because someone pointed the IDT to MMIO area, and who > >>> knows what the VM-exit instruction length points to in that case. > >> > >> Thinking more about it, I don't really see how anything > >> legal guest might be doing with virtio would trigger anything > >> but a fault after decoding the instruction. How does > >> skipping instruction even make sense in the example you give? > > > > There's no such thing as a legal guest. Anything that the hypervisor > > does, that differs from real hardware, is a possible escalation path. > > > > This in fact makes me doubt the EMULTYPE_SKIP patch too. > > The main hack is that we expect EPT misconfig within a given range to be > a MMIO NULL write. I think it is fine -- EMULTYPE_SKIP is a common path > that should have well tested error paths and, IIUC, virtio doesn't allow > any other access, so it is a problem of the guest if a buggy/malicious > application can access virtio memory. > > >>>>> Plus of course it wouldn't be guaranteed to work on nested. > >>>> > >>>> Not sure I got this one. > >>> > >>> Not all nested hypervisors are setting the VM-exit instruction length > >>> field on EPT violations, since it's documented not to be set. > >> > >> So that's probably the real issue - nested virt which has to do it > >> in software at extra cost. We already limit this to intel processors, > > Hm, there is no reason to exclude SVM. > > >> how about we blacklist nested virt for this optimization? > > Not every hypervisor can be easily detected ... Hypervisors that don't set a hypervisor bit in CPUID are violating the spec themselves, aren't they? Anyway, we can add a management option for use in a nested scenario. > KVM uses standard > features and SDM clearly says that the instruction length field is > undefined. True. Let's see whether intel can commit to a stronger definition. I don't think there's any rush to make this change. > We only lose performance if we decode the instruction, but piling > workarounds creates unexpected corner cases. > > I still don't see acceptable alternatives to Paolo's solution. It's just that this has been there for 3 years and people have built a product around this. It's not a feature you can discard out of hand out of theoretical concerns or to improve niche use-cases such as nested virt.
On 16/08/2017 21:59, Michael S. Tsirkin wrote: > On Wed, Aug 16, 2017 at 09:03:17PM +0200, Radim Krčmář wrote: >> 2017-08-16 19:19+0200, Paolo Bonzini: >>> On 16/08/2017 18:50, Michael S. Tsirkin wrote: >>>> On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote: >>>>> While you can filter out instruction fetches, that's not enough. A data >>>>> read could happen because someone pointed the IDT to MMIO area, and who >>>>> knows what the VM-exit instruction length points to in that case. >>>> >>>> Thinking more about it, I don't really see how anything >>>> legal guest might be doing with virtio would trigger anything >>>> but a fault after decoding the instruction. How does >>>> skipping instruction even make sense in the example you give? >>> >>> There's no such thing as a legal guest. Anything that the hypervisor >>> does, that differs from real hardware, is a possible escalation path. >>> >>> This in fact makes me doubt the EMULTYPE_SKIP patch too. >> >> The main hack is that we expect EPT misconfig within a given range to be >> a MMIO NULL write. I think it is fine -- EMULTYPE_SKIP is a common path >> that should have well tested error paths and, IIUC, virtio doesn't allow >> any other access, so it is a problem of the guest if a buggy/malicious >> application can access virtio memory. Yes, I agree. EMULTYPE_SKIP is fine because failed decoding still causes an exception to be injected. Maybe it's better to gate the EMULTYPE_SKIP emulation on the exit qualification saying this is a write and also not a page table walk---just in case. >>>> how about we blacklist nested virt for this optimization? >> >> Not every hypervisor can be easily detected ... > > Hypervisors that don't set a hypervisor bit in CPUID are violating the > spec themselves, aren't they? Anyway, we can add a management option > for use in a nested scenario. No, the hypervisor bit only says that CPUID leaf 0x40000000 is defined. See for example https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458: "Intel and AMD have also reserved CPUID leaves 0x40000000 - 0x400000FF for software use. Hypervisors can use these leaves to provide an interface to pass information from the hypervisor to the guest operating system running inside a virtual machine. The hypervisor bit indicates the presence of a hypervisor and that it is safe to test these additional software leaves". >> KVM uses standard features and SDM clearly says that the >> instruction length field is undefined. > > True. Let's see whether intel can commit to a stronger definition. > I don't think there's any rush to make this change. I disagree. Relying on undefined processor features is a bad idea. > It's just that this has been there for 3 years and people have built a > product around this. Around 700 clock cycles? Paolo
On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote: > On 16/08/2017 21:59, Michael S. Tsirkin wrote: > > On Wed, Aug 16, 2017 at 09:03:17PM +0200, Radim Krčmář wrote: > >> 2017-08-16 19:19+0200, Paolo Bonzini: > >>> On 16/08/2017 18:50, Michael S. Tsirkin wrote: > >>>> On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote: > >>>>> While you can filter out instruction fetches, that's not enough. A data > >>>>> read could happen because someone pointed the IDT to MMIO area, and who > >>>>> knows what the VM-exit instruction length points to in that case. > >>>> > >>>> Thinking more about it, I don't really see how anything > >>>> legal guest might be doing with virtio would trigger anything > >>>> but a fault after decoding the instruction. How does > >>>> skipping instruction even make sense in the example you give? > >>> > >>> There's no such thing as a legal guest. Anything that the hypervisor > >>> does, that differs from real hardware, is a possible escalation path. > >>> > >>> This in fact makes me doubt the EMULTYPE_SKIP patch too. > >> > >> The main hack is that we expect EPT misconfig within a given range to be > >> a MMIO NULL write. I think it is fine -- EMULTYPE_SKIP is a common path > >> that should have well tested error paths and, IIUC, virtio doesn't allow > >> any other access, so it is a problem of the guest if a buggy/malicious > >> application can access virtio memory. > > Yes, I agree. EMULTYPE_SKIP is fine because failed decoding still > causes an exception to be injected. Maybe it's better to gate the > EMULTYPE_SKIP emulation on the exit qualification saying this is a write I thought it's already limited to writes. I agree that's a reasonable limitation in any case. > and also not a page table walk---just in case. I still don't get it, sorry. Let's assume for the sake of argument that it's a PT walk causing the MMIO access. Just why do you think that it makes sense to skip the instruction that caused the walk? > >>>> how about we blacklist nested virt for this optimization? > >> > >> Not every hypervisor can be easily detected ... > > > > Hypervisors that don't set a hypervisor bit in CPUID are violating the > > spec themselves, aren't they? Anyway, we can add a management option > > for use in a nested scenario. > > No, the hypervisor bit only says that CPUID leaf 0x40000000 is defined. > See for example > https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458: > "Intel and AMD have also reserved CPUID leaves 0x40000000 - 0x400000FF > for software use. Hypervisors can use these leaves to provide an > interface to pass information from the hypervisor to the guest operating > system running inside a virtual machine. The hypervisor bit indicates > the presence of a hypervisor and that it is safe to test these > additional software leaves". Looks like it's not a bug then. Still, most hypervisors do have this leaf so it's a reasonable way that will catch most issues. We can always blacklist more as they are found. Additionally let's go ahead and add ability for userspace to disable fast MMIO for these hypervisors we failed to detect. > >> KVM uses standard features and SDM clearly says that the > >> instruction length field is undefined. > > > > True. Let's see whether intel can commit to a stronger definition. > > I don't think there's any rush to make this change. > > I disagree. Relying on undefined processor features is a bad idea. Maybe it was a bad idea 3 years ago, yes. In 2012 I posted "kvm_para: add mmio word store hypercall" as an alternative. Was nacked as MMIO was seen as safer and better. By now many people rely on mmio being fast. Let's talk to hardware guys to define the feature before we give up and spend years designing an alternative. > > It's just that this has been there for 3 years and people have built a > > product around this. > > Around 700 clock cycles? > > Paolo About 30% the cost of exit, isn't it? There are definitely workloads where cost of exit gates performance. We didn't work on fast mmio based on theoretical assumptions. But maybe I am wrong. We'll see. Jason here volunteered to test your patch and we'll see what comes out of it. If I'm wrong and it's about 1%, I won't split hairs.
On 16.08.2017 14:07, Radim Krčmář wrote: > 2017-08-16 13:22+0200, Paolo Bonzini: >> Microsoft pointed out privately to me that KVM's handling of >> KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid >> in EPT misconfiguration vmexit handlers, because neither EPT violations >> nor misconfigurations are listed in the manual among the VM exits that >> set the VM-exit instruction length field. >> >> While physical processors seem to set the field, this is not architectural >> and is just a side effect of the implementation. I couldn't convince >> myself of any condition on the exit qualification where VM-exit >> instruction length "has" to be defined; there are no trap-like VM-exits >> that can be repurposed; and fault-like VM-exits such as descriptor-table >> exits provide no decoding information. So I don't really see any elegant >> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio >> 1 will go slower. > > Do you have some numbers? > > We could keep the ugliness in KVM and add a new skip function with > emulate_instruction(vcpu, EMULTYPE_SKIP) to decode the length of the > instruction. (Adding a condition just for EPT violation exit reason to I like that idea. > the existing skip function would be a dirtier solution.) > Slower than what we have now, but faster than full emulation. > > I agree that configuring EPT to throw a violation when accessing fast > MMIO has many drawbacks (although it seems to be what Intel expected). > > Thanks.
On 17/08/2017 00:31, Michael S. Tsirkin wrote: > On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote: >> Yes, I agree. EMULTYPE_SKIP is fine because failed decoding still >> causes an exception to be injected. Maybe it's better to gate the >> EMULTYPE_SKIP emulation on the exit qualification saying this is a write > > I thought it's already limited to writes. I agree that's a reasonable > limitation in any case. > >> and also not a page table walk---just in case. > > I still don't get it, sorry. Let's assume for the sake of argument > that it's a PT walk causing the MMIO access. Just why do you think > that it makes sense to skip the instruction that caused the walk? I think it doesn't. I think in that case it's better to skip the fast write and proceed with full emulation. >>> It's just that this has been there for 3 years and people have built a >>> product around this. >> >> Around 700 clock cycles? > > About 30% the cost of exit, isn't it? There are definitely workloads > where cost of exit gates performance. We didn't work on fast mmio based > on theoretical assumptions. But maybe I am wrong. We'll see. Jason here > volunteered to test your patch and we'll see what comes out of it. If > I'm wrong and it's about 1%, I won't split hairs. Note that we still get the latency benefit from fast MMIO, and maybe we can cut a couple hundred clock cycles more---which would benefit all emulation, not just fast MMIO. Paolo
On 17/08/2017 11:00, Paolo Bonzini wrote: > On 17/08/2017 00:31, Michael S. Tsirkin wrote: >> On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote: >>> Yes, I agree. EMULTYPE_SKIP is fine because failed decoding still >>> causes an exception to be injected. Maybe it's better to gate the >>> EMULTYPE_SKIP emulation on the exit qualification saying this is a write >> >> I thought it's already limited to writes. I agree that's a reasonable >> limitation in any case. >> >>> and also not a page table walk---just in case. >> >> I still don't get it, sorry. Let's assume for the sake of argument >> that it's a PT walk causing the MMIO access. Just why do you think >> that it makes sense to skip the instruction that caused the walk? > > I think it doesn't. I think in that case it's better to skip the fast > write and proceed with full emulation. ... nope, exit qualification is just zero for EPT misconfigurations, so we cannot do this. Paolo
2017-08-17 14:14+0200, Paolo Bonzini: > On 17/08/2017 11:00, Paolo Bonzini wrote: > > On 17/08/2017 00:31, Michael S. Tsirkin wrote: > >> On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote: > >>> Yes, I agree. EMULTYPE_SKIP is fine because failed decoding still > >>> causes an exception to be injected. Maybe it's better to gate the > >>> EMULTYPE_SKIP emulation on the exit qualification saying this is a write > >> > >> I thought it's already limited to writes. I agree that's a reasonable > >> limitation in any case. > >> > >>> and also not a page table walk---just in case. > >> > >> I still don't get it, sorry. Let's assume for the sake of argument > >> that it's a PT walk causing the MMIO access. Just why do you think > >> that it makes sense to skip the instruction that caused the walk? > > > > I think it doesn't. I think in that case it's better to skip the fast > > write and proceed with full emulation. It definitely doesn't. I think it's possible only because virtio forbids other access types, so we assume that every access is a write access. > ... nope, exit qualification is just zero for EPT misconfigurations, so > we cannot do this. Yeah, we'd need to use EPT violation to get that information.
2017-08-17 01:31+0300, Michael S. Tsirkin: > On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote: > > On 16/08/2017 21:59, Michael S. Tsirkin wrote: > > > On Wed, Aug 16, 2017 at 09:03:17PM +0200, Radim Krčmář wrote: > > >>>> how about we blacklist nested virt for this optimization? > > >> > > >> Not every hypervisor can be easily detected ... > > > > > > Hypervisors that don't set a hypervisor bit in CPUID are violating the > > > spec themselves, aren't they? Anyway, we can add a management option > > > for use in a nested scenario. > > > > No, the hypervisor bit only says that CPUID leaf 0x40000000 is defined. > > See for example > > https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458: > > "Intel and AMD have also reserved CPUID leaves 0x40000000 - 0x400000FF > > for software use. Hypervisors can use these leaves to provide an > > interface to pass information from the hypervisor to the guest operating > > system running inside a virtual machine. The hypervisor bit indicates > > the presence of a hypervisor and that it is safe to test these > > additional software leaves". > > Looks like it's not a bug then. Still, most hypervisors do have this > leaf so it's a reasonable way that will catch most issues. We can > always blacklist more as they are found. Additionally let's go ahead > and add ability for userspace to disable fast MMIO for these > hypervisors we failed to detect. In the worst case, I'd make faster mmio an opt-in unsafe feature regardless of what we run on. Users that just want KVM to work get the default and people who care about utmost performance can jump through loops. > > >> KVM uses standard features and SDM clearly says that the > > >> instruction length field is undefined. > > > > > > True. Let's see whether intel can commit to a stronger definition. > > > I don't think there's any rush to make this change. > > > > I disagree. Relying on undefined processor features is a bad idea. > > Maybe it was a bad idea 3 years ago, yes. In 2012 I posted "kvm_para: > add mmio word store hypercall" as an alternative. Was nacked as MMIO > was seen as safer and better. By now many people rely on mmio being > fast. Let's talk to hardware guys to define the feature before we give > up and spend years designing an alternative. The change is not backward-compatible wrt. SDM, but all processors might actually be behaving like we want ... (I'd assert undefined behavior add a vm-exit flag if I were to allow it, though.) > > > It's just that this has been there for 3 years and people have built a > > > product around this. > > > > Around 700 clock cycles? > > > > Paolo > > About 30% the cost of exit, isn't it? There are definitely workloads > where cost of exit gates performance. We didn't work on fast mmio based > on theoretical assumptions. But maybe I am wrong. We'll see. Jason here > volunteered to test your patch and we'll see what comes out of it. If > I'm wrong and it's about 1%, I won't split hairs. I'm ok with waiting for the numbers as I hope that we won't have to resort to adding special cases.
On Thu, Aug 17, 2017 at 11:00:01AM +0200, Paolo Bonzini wrote: > >> and also not a page table walk---just in case. > > > > I still don't get it, sorry. Let's assume for the sake of argument > > that it's a PT walk causing the MMIO access. Just why do you think > > that it makes sense to skip the instruction that caused the walk? > > I think it doesn't. I think in that case it's better to skip the fast > write and proceed with full emulation. Right. So my argument is that fast mmio is used for paravirtualization exclusively. Thus someone doing anything that isn't a memory write to trigger it gets to keep both pieces. I get it that even writes don't work in some nested virt setups, and this seems worth fixing, and I get it that we have inadvertently relied on an undocumented behaviour. I would like to understand whether you believe that on bare metal and when accessing MMIO using writes exclusively, and after checking the address matches one of the paravirtualized device, there's still a real chance length is invalid, or is it more a question of missing documentation.
On Thu, Aug 17, 2017 at 03:51:31PM +0200, Radim Krčmář wrote: > 2017-08-17 01:31+0300, Michael S. Tsirkin: > > On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote: > > > On 16/08/2017 21:59, Michael S. Tsirkin wrote: > > > > On Wed, Aug 16, 2017 at 09:03:17PM +0200, Radim Krčmář wrote: > > > >>>> how about we blacklist nested virt for this optimization? > > > >> > > > >> Not every hypervisor can be easily detected ... > > > > > > > > Hypervisors that don't set a hypervisor bit in CPUID are violating the > > > > spec themselves, aren't they? Anyway, we can add a management option > > > > for use in a nested scenario. > > > > > > No, the hypervisor bit only says that CPUID leaf 0x40000000 is defined. > > > See for example > > > https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458: > > > "Intel and AMD have also reserved CPUID leaves 0x40000000 - 0x400000FF > > > for software use. Hypervisors can use these leaves to provide an > > > interface to pass information from the hypervisor to the guest operating > > > system running inside a virtual machine. The hypervisor bit indicates > > > the presence of a hypervisor and that it is safe to test these > > > additional software leaves". > > > > Looks like it's not a bug then. Still, most hypervisors do have this > > leaf so it's a reasonable way that will catch most issues. We can > > always blacklist more as they are found. Additionally let's go ahead > > and add ability for userspace to disable fast MMIO for these > > hypervisors we failed to detect. > > In the worst case, I'd make faster mmio an opt-in unsafe feature > regardless of what we run on. Users that just want KVM to work get the > default and people who care about utmost performance can jump through > loops. Might have been reasonable originally but you do not want to slow down everyone and then make them jump through hoops just to get back where they were originally. So I think it has to be opt-out, really for nested setups at this point. > > > >> KVM uses standard features and SDM clearly says that the > > > >> instruction length field is undefined. > > > > > > > > True. Let's see whether intel can commit to a stronger definition. > > > > I don't think there's any rush to make this change. > > > > > > I disagree. Relying on undefined processor features is a bad idea. > > > > Maybe it was a bad idea 3 years ago, yes. In 2012 I posted "kvm_para: > > add mmio word store hypercall" as an alternative. Was nacked as MMIO > > was seen as safer and better. By now many people rely on mmio being > > fast. Let's talk to hardware guys to define the feature before we give > > up and spend years designing an alternative. > > The change is not backward-compatible wrt. SDM, but all processors might > actually be behaving like we want ... (I'd assert undefined behavior > add a vm-exit flag if I were to allow it, though.) > > > > > It's just that this has been there for 3 years and people have built a > > > > product around this. > > > > > > Around 700 clock cycles? > > > > > > Paolo > > > > About 30% the cost of exit, isn't it? There are definitely workloads > > where cost of exit gates performance. We didn't work on fast mmio based > > on theoretical assumptions. But maybe I am wrong. We'll see. Jason here > > volunteered to test your patch and we'll see what comes out of it. If > > I'm wrong and it's about 1%, I won't split hairs. > > I'm ok with waiting for the numbers as I hope that we won't have to > resort to adding special cases.
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 375dca24cf42..b3eaeb20670d 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -6320,11 +6320,6 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) gpa_t gpa; gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); - if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) { - trace_kvm_fast_mmio(gpa); - return kvm_skip_emulated_instruction(vcpu); - } - ret = handle_mmio_page_fault(vcpu, gpa, true); vcpu->arch.gpa_available = true; if (likely(ret == RET_MMIO_PF_EMULATE))
Microsoft pointed out privately to me that KVM's handling of KVM_FAST_MMIO_BUS is invalid. Using skip_emulation_instruction is invalid in EPT misconfiguration vmexit handlers, because neither EPT violations nor misconfigurations are listed in the manual among the VM exits that set the VM-exit instruction length field. While physical processors seem to set the field, this is not architectural and is just a side effect of the implementation. I couldn't convince myself of any condition on the exit qualification where VM-exit instruction length "has" to be defined; there are no trap-like VM-exits that can be repurposed; and fault-like VM-exits such as descriptor-table exits provide no decoding information. So I don't really see any elegant way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio 1 will go slower. Adding a hypercall or MSR write that does a fast MMIO write to a physical address would do it, but it adds hypervisor knowledge in virtio, including CPUID handling. So it would be pretty ugly in the guest-side implementation, but if somebody wants to do it and the virtio side is acceptable to the virtio maintainers, I am okay with it. Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: stable@vger.kernel.org Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/kvm/vmx.c | 5 ----- 1 file changed, 5 deletions(-)