kvm: x86: disable KVM_FAST_MMIO_BUS

Message ID	20170816112249.28939-1-pbonzini@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Paolo Bonzini <pbonzini@redhat.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: "Michael S . Tsirkin" <mst@redhat.com>, =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?= <rkrcmar@redhat.com>, stable@vger.kernel.org Subject: [PATCH] kvm: x86: disable KVM_FAST_MMIO_BUS Date: Wed, 16 Aug 2017 13:22:49 +0200 Message-Id: <20170816112249.28939-1-pbonzini@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: kvm-owner@vger.kernel.org Precedence: bulk

Paolo Bonzini Aug. 16, 2017, 11:22 a.m. UTC

Microsoft pointed out privately to me that KVM's handling of
KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
in EPT misconfiguration vmexit handlers, because neither EPT violations
nor misconfigurations are listed in the manual among the VM exits that
set the VM-exit instruction length field.

While physical processors seem to set the field, this is not architectural
and is just a side effect of the implementation.  I couldn't convince
myself of any condition on the exit qualification where VM-exit
instruction length "has" to be defined; there are no trap-like VM-exits
that can be repurposed; and fault-like VM-exits such as descriptor-table
exits provide no decoding information.  So I don't really see any elegant
way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
1 will go slower.

Adding a hypercall or MSR write that does a fast MMIO write to a physical
address would do it, but it adds hypervisor knowledge in virtio, including
CPUID handling.  So it would be pretty ugly in the guest-side implementation,
but if somebody wants to do it and the virtio side is acceptable to the
virtio maintainers, I am okay with it.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx.c | 5 -----
 1 file changed, 5 deletions(-)

Radim Krčmář Aug. 16, 2017, 12:07 p.m. UTC | #1

2017-08-16 13:22+0200, Paolo Bonzini:
> Microsoft pointed out privately to me that KVM's handling of
> KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
> in EPT misconfiguration vmexit handlers, because neither EPT violations
> nor misconfigurations are listed in the manual among the VM exits that
> set the VM-exit instruction length field.
> 
> While physical processors seem to set the field, this is not architectural
> and is just a side effect of the implementation.  I couldn't convince
> myself of any condition on the exit qualification where VM-exit
> instruction length "has" to be defined; there are no trap-like VM-exits
> that can be repurposed; and fault-like VM-exits such as descriptor-table
> exits provide no decoding information.  So I don't really see any elegant
> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
> 1 will go slower.

Do you have some numbers?

We could keep the ugliness in KVM and add a new skip function with
emulate_instruction(vcpu, EMULTYPE_SKIP) to decode the length of the
instruction.  (Adding a condition just for EPT violation exit reason to
the existing skip function would be a dirtier solution.)
Slower than what we have now, but faster than full emulation.

I agree that configuring EPT to throw a violation when accessing fast
MMIO has many drawbacks (although it seems to be what Intel expected).

Thanks.

> Adding a hypercall or MSR write that does a fast MMIO write to a physical
> address would do it, but it adds hypervisor knowledge in virtio, including
> CPUID handling.  So it would be pretty ugly in the guest-side implementation,
> but if somebody wants to do it and the virtio side is acceptable to the
> virtio maintainers, I am okay with it.
> 
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/vmx.c | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 375dca24cf42..b3eaeb20670d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6320,11 +6320,6 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>  	gpa_t gpa;
>  
>  	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> -	if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> -		trace_kvm_fast_mmio(gpa);
> -		return kvm_skip_emulated_instruction(vcpu);
> -	}
> -
>  	ret = handle_mmio_page_fault(vcpu, gpa, true);
>  	vcpu->arch.gpa_available = true;
>  	if (likely(ret == RET_MMIO_PF_EMULATE))
> -- 
> 2.13.5
>

Michael S. Tsirkin Aug. 16, 2017, 12:58 p.m. UTC | #2

On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote:
> Microsoft pointed out privately to me that KVM's handling of
> KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
> in EPT misconfiguration vmexit handlers, because neither EPT violations
> nor misconfigurations are listed in the manual among the VM exits that
> set the VM-exit instruction length field.
> 
> While physical processors seem to set the field, this is not architectural
> and is just a side effect of the implementation.  I couldn't convince
> myself of any condition on the exit qualification where VM-exit
> instruction length "has" to be defined; there are no trap-like VM-exits
> that can be repurposed; and fault-like VM-exits such as descriptor-table
> exits provide no decoding information.  So I don't really see any elegant
> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
> 1 will go slower.

How about I will try asking Intel about it? If they can commit to length
being there in the future, we are all set.

> Adding a hypercall or MSR write that does a fast MMIO write to a physical
> address would do it, but it adds hypervisor knowledge in virtio, including
> CPUID handling.

Another issue is that it will break DPDK on virtio.

>  So it would be pretty ugly in the guest-side implementation,
> but if somebody wants to do it and the virtio side is acceptable to the
> virtio maintainers, I am okay with it.
> 
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Hmm that's quite unfortunate as we have just completed rolling out MMIO
signalling across the board. We did measure a significant slowdown
before enabling fast mmio. 

    Guest TX:(TCP)
    size/session/+throughput%/+cpu%/-+per cpu%/
    64/1/[+18.9183%]/-0.2823%/[+19.2550%]/
    64/2/[+13.5714%]/[+2.2675%]/[+11.0533%]/
    64/4/[+13.1070%]/[+2.1817%]/[+10.6920%]/
    64/8/[+13.0426%]/[+2.0887%]/[+10.7299%]/
    256/1/[+36.2761%]/+6.3434%/[+28.1471%]/
    ...
    1024/1/[+44.8873%]/+2.0811%/[+41.9335%]/
    ...
    1024/4/+0.0228%/[-2.2044%]/[+2.2774%]/
    ...
    16384/2/+0.0127%/[-5.0346%]/[+5.3148%]/
    ...
    65535/1/[+0.0062%]/[-4.1183%]/[+4.3017%]/
    65535/2/+0.0004%/[-4.2311%]/[+4.4185%]/
    65535/4/+0.0107%/[-4.6106%]/[+4.8446%]/
    65535/8/-0.0090%/[-5.5178%]/[+5.8306%]/
    

See commit bc85ccfdf5cc045588f665c84b5707d7364c8a6c for more numbers.


> ---
>  arch/x86/kvm/vmx.c | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 375dca24cf42..b3eaeb20670d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6320,11 +6320,6 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>  	gpa_t gpa;
>  
>  	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> -	if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> -		trace_kvm_fast_mmio(gpa);
> -		return kvm_skip_emulated_instruction(vcpu);
> -	}
> -
>  	ret = handle_mmio_page_fault(vcpu, gpa, true);
>  	vcpu->arch.gpa_available = true;
>  	if (likely(ret == RET_MMIO_PF_EMULATE))
> -- 
> 2.13.5

Paolo Bonzini Aug. 16, 2017, 1:05 p.m. UTC | #3

On 16/08/2017 14:58, Michael S. Tsirkin wrote:
> On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote:
>> Microsoft pointed out privately to me that KVM's handling of
>> KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
>> in EPT misconfiguration vmexit handlers, because neither EPT violations
>> nor misconfigurations are listed in the manual among the VM exits that
>> set the VM-exit instruction length field.
>>
>> While physical processors seem to set the field, this is not architectural
>> and is just a side effect of the implementation.  I couldn't convince
>> myself of any condition on the exit qualification where VM-exit
>> instruction length "has" to be defined; there are no trap-like VM-exits
>> that can be repurposed; and fault-like VM-exits such as descriptor-table
>> exits provide no decoding information.  So I don't really see any elegant
>> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
>> 1 will go slower.
> 
> How about I will try asking Intel about it? If they can commit to length
> being there in the future, we are all set.

Nope, "I couldn't convince myself of any condition on the exit
qualification where VM-exit instruction length "has" to be defined".  So
assuming Intel can do it, it would only apply to future processors (2
years+ for server SKUs).

Plus of course it wouldn't be guaranteed to work on nested.

>> Adding a hypercall or MSR write that does a fast MMIO write to a physical
>> address would do it, but it adds hypervisor knowledge in virtio, including
>> CPUID handling.
> 
> Another issue is that it will break DPDK on virtio.

Not break, just make it slower.

Paolo

> Hmm that's quite unfortunate as we have just completed rolling out MMIO
> signalling across the board. We did measure a significant slowdown
> before enabling fast mmio. 
> 
>     Guest TX:(TCP)
>     size/session/+throughput%/+cpu%/-+per cpu%/
>     64/1/[+18.9183%]/-0.2823%/[+19.2550%]/
>     64/2/[+13.5714%]/[+2.2675%]/[+11.0533%]/
>     64/4/[+13.1070%]/[+2.1817%]/[+10.6920%]/
>     64/8/[+13.0426%]/[+2.0887%]/[+10.7299%]/
>     256/1/[+36.2761%]/+6.3434%/[+28.1471%]/
>     ...
>     1024/1/[+44.8873%]/+2.0811%/[+41.9335%]/
>     ...
>     1024/4/+0.0228%/[-2.2044%]/[+2.2774%]/
>     ...
>     16384/2/+0.0127%/[-5.0346%]/[+5.3148%]/
>     ...
>     65535/1/[+0.0062%]/[-4.1183%]/[+4.3017%]/
>     65535/2/+0.0004%/[-4.2311%]/[+4.4185%]/
>     65535/4/+0.0107%/[-4.6106%]/[+4.8446%]/
>     65535/8/-0.0090%/[-5.5178%]/[+5.8306%]/
>     
> 
> See commit bc85ccfdf5cc045588f665c84b5707d7364c8a6c for more numbers.
> 
> 
>> ---
>>  arch/x86/kvm/vmx.c | 5 -----
>>  1 file changed, 5 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 375dca24cf42..b3eaeb20670d 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -6320,11 +6320,6 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>>  	gpa_t gpa;
>>  
>>  	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
>> -	if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
>> -		trace_kvm_fast_mmio(gpa);
>> -		return kvm_skip_emulated_instruction(vcpu);
>> -	}
>> -
>>  	ret = handle_mmio_page_fault(vcpu, gpa, true);
>>  	vcpu->arch.gpa_available = true;
>>  	if (likely(ret == RET_MMIO_PF_EMULATE))
>> -- 
>> 2.13.5

Michael S. Tsirkin Aug. 16, 2017, 1:16 p.m. UTC | #4

On Wed, Aug 16, 2017 at 03:05:51PM +0200, Paolo Bonzini wrote:
> On 16/08/2017 14:58, Michael S. Tsirkin wrote:
> > On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote:
> >> Microsoft pointed out privately to me that KVM's handling of
> >> KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
> >> in EPT misconfiguration vmexit handlers, because neither EPT violations
> >> nor misconfigurations are listed in the manual among the VM exits that
> >> set the VM-exit instruction length field.
> >>
> >> While physical processors seem to set the field, this is not architectural
> >> and is just a side effect of the implementation.  I couldn't convince
> >> myself of any condition on the exit qualification where VM-exit
> >> instruction length "has" to be defined; there are no trap-like VM-exits
> >> that can be repurposed; and fault-like VM-exits such as descriptor-table
> >> exits provide no decoding information.  So I don't really see any elegant
> >> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
> >> 1 will go slower.
> > 
> > How about I will try asking Intel about it? If they can commit to length
> > being there in the future, we are all set.
> 
> Nope, "I couldn't convince myself of any condition on the exit
> qualification where VM-exit instruction length "has" to be defined".  So
> assuming Intel can do it, it would only apply to future processors (2
> years+ for server SKUs).

Well maybe there's a reason it's actually working.  Let's see what can
be done.

> 
> Plus of course it wouldn't be guaranteed to work on nested.

Not sure I got this one.

> >> Adding a hypercall or MSR write that does a fast MMIO write to a physical
> >> address would do it, but it adds hypervisor knowledge in virtio, including
> >> CPUID handling.
> > 
> > Another issue is that it will break DPDK on virtio.
> 
> Not break, just make it slower.

I thought hypercalls can only be triggered from ring 0, userspace can't call them.
Dod I get it wrong?

> Paolo
> 
> > Hmm that's quite unfortunate as we have just completed rolling out MMIO
> > signalling across the board. We did measure a significant slowdown
> > before enabling fast mmio. 
> > 
> >     Guest TX:(TCP)
> >     size/session/+throughput%/+cpu%/-+per cpu%/
> >     64/1/[+18.9183%]/-0.2823%/[+19.2550%]/
> >     64/2/[+13.5714%]/[+2.2675%]/[+11.0533%]/
> >     64/4/[+13.1070%]/[+2.1817%]/[+10.6920%]/
> >     64/8/[+13.0426%]/[+2.0887%]/[+10.7299%]/
> >     256/1/[+36.2761%]/+6.3434%/[+28.1471%]/
> >     ...
> >     1024/1/[+44.8873%]/+2.0811%/[+41.9335%]/
> >     ...
> >     1024/4/+0.0228%/[-2.2044%]/[+2.2774%]/
> >     ...
> >     16384/2/+0.0127%/[-5.0346%]/[+5.3148%]/
> >     ...
> >     65535/1/[+0.0062%]/[-4.1183%]/[+4.3017%]/
> >     65535/2/+0.0004%/[-4.2311%]/[+4.4185%]/
> >     65535/4/+0.0107%/[-4.6106%]/[+4.8446%]/
> >     65535/8/-0.0090%/[-5.5178%]/[+5.8306%]/
> >     
> > 
> > See commit bc85ccfdf5cc045588f665c84b5707d7364c8a6c for more numbers.
> > 
> > 
> >> ---
> >>  arch/x86/kvm/vmx.c | 5 -----
> >>  1 file changed, 5 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index 375dca24cf42..b3eaeb20670d 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -6320,11 +6320,6 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> >>  	gpa_t gpa;
> >>  
> >>  	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> >> -	if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> >> -		trace_kvm_fast_mmio(gpa);
> >> -		return kvm_skip_emulated_instruction(vcpu);
> >> -	}
> >> -
> >>  	ret = handle_mmio_page_fault(vcpu, gpa, true);
> >>  	vcpu->arch.gpa_available = true;
> >>  	if (likely(ret == RET_MMIO_PF_EMULATE))
> >> -- 
> >> 2.13.5

Paolo Bonzini Aug. 16, 2017, 1:30 p.m. UTC | #5

On 16/08/2017 15:16, Michael S. Tsirkin wrote:
> On Wed, Aug 16, 2017 at 03:05:51PM +0200, Paolo Bonzini wrote:
>> On 16/08/2017 14:58, Michael S. Tsirkin wrote:
>>> On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote:
>>>> Microsoft pointed out privately to me that KVM's handling of
>>>> KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
>>>> in EPT misconfiguration vmexit handlers, because neither EPT violations
>>>> nor misconfigurations are listed in the manual among the VM exits that
>>>> set the VM-exit instruction length field.
>>>>
>>>> While physical processors seem to set the field, this is not architectural
>>>> and is just a side effect of the implementation.  I couldn't convince
>>>> myself of any condition on the exit qualification where VM-exit
>>>> instruction length "has" to be defined; there are no trap-like VM-exits
>>>> that can be repurposed; and fault-like VM-exits such as descriptor-table
>>>> exits provide no decoding information.  So I don't really see any elegant
>>>> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
>>>> 1 will go slower.
>>>
>>> How about I will try asking Intel about it? If they can commit to length
>>> being there in the future, we are all set.
>>
>> Nope, "I couldn't convince myself of any condition on the exit
>> qualification where VM-exit instruction length "has" to be defined".  So
>> assuming Intel can do it, it would only apply to future processors (2
>> years+ for server SKUs).
> 
> Well maybe there's a reason it's actually working.  Let's see what can
> be done.

Sure there is.  It just happens that the actual condition for VM-exit
instruction length being set correctly is "the fault was taken after the
accessing instruction has been decoded".  But there's no way according
to the spec to detect whether that has happened.

While you can filter out instruction fetches, that's not enough.  A data
read could happen because someone pointed the IDT to MMIO area, and who
knows what the VM-exit instruction length points to in that case.

>> Plus of course it wouldn't be guaranteed to work on nested.
> 
> Not sure I got this one.

Not all nested hypervisors are setting the VM-exit instruction length
field on EPT violations, since it's documented not to be set.

>>>> Adding a hypercall or MSR write that does a fast MMIO write to a physical
>>>> address would do it, but it adds hypervisor knowledge in virtio, including
>>>> CPUID handling.
>>>
>>> Another issue is that it will break DPDK on virtio.
>>
>> Not break, just make it slower.
> 
> I thought hypercalls can only be triggered from ring 0, userspace can't call them.
> Dod I get it wrong?

That's just a limitation that KVM makes on currently-defined hypercalls.

VMCALL causes a vmexit if executed from ring 3.

Paolo

Paolo Bonzini Aug. 16, 2017, 1:37 p.m. UTC | #6

On 16/08/2017 14:07, Radim Krčmář wrote:
> 2017-08-16 13:22+0200, Paolo Bonzini:
>> Microsoft pointed out privately to me that KVM's handling of
>> KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
>> in EPT misconfiguration vmexit handlers, because neither EPT violations
>> nor misconfigurations are listed in the manual among the VM exits that
>> set the VM-exit instruction length field.
>>
>> While physical processors seem to set the field, this is not architectural
>> and is just a side effect of the implementation.  I couldn't convince
>> myself of any condition on the exit qualification where VM-exit
>> instruction length "has" to be defined; there are no trap-like VM-exits
>> that can be repurposed; and fault-like VM-exits such as descriptor-table
>> exits provide no decoding information.  So I don't really see any elegant
>> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
>> 1 will go slower.
> 
> Do you have some numbers?

Raw number from vmexit.flat on Haswell-EP:

mmio-no-eventfd:pci-mem 5793
mmio-wildcard-eventfd:pci-mem 1395
mmio-datamatch-eventfd:pci-mem 2268

So roughly 900 clock cycles.  Most of the work is the four memory reads
done by x86_decode_insn, three to walk the page tables and one to fetch
the instruction.

> We could keep the ugliness in KVM and add a new skip function with
> emulate_instruction(vcpu, EMULTYPE_SKIP) to decode the length of the
> instruction.  (Adding a condition just for EPT violation exit reason to
> the existing skip function would be a dirtier solution.)
> Slower than what we have now, but faster than full emulation.

This is actually a good idea, and not ugly at all!  The main cost is
translating the physical address of the instruction and fetching the
bytes, so only 200 clock cycles are saved.

However, the eventfd is written before decoding, while full emulation
would write it after. So while VCPU thread latency is worse compared to
skip_emulated_instruction, latency to the iothread remains small.

Paolo

Michael S. Tsirkin Aug. 16, 2017, 2:03 p.m. UTC | #7

On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote:
> On 16/08/2017 15:16, Michael S. Tsirkin wrote:
> > On Wed, Aug 16, 2017 at 03:05:51PM +0200, Paolo Bonzini wrote:
> >> On 16/08/2017 14:58, Michael S. Tsirkin wrote:
> >>> On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote:
> >>>> Microsoft pointed out privately to me that KVM's handling of
> >>>> KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
> >>>> in EPT misconfiguration vmexit handlers, because neither EPT violations
> >>>> nor misconfigurations are listed in the manual among the VM exits that
> >>>> set the VM-exit instruction length field.
> >>>>
> >>>> While physical processors seem to set the field, this is not architectural
> >>>> and is just a side effect of the implementation.  I couldn't convince
> >>>> myself of any condition on the exit qualification where VM-exit
> >>>> instruction length "has" to be defined; there are no trap-like VM-exits
> >>>> that can be repurposed; and fault-like VM-exits such as descriptor-table
> >>>> exits provide no decoding information.  So I don't really see any elegant
> >>>> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
> >>>> 1 will go slower.
> >>>
> >>> How about I will try asking Intel about it? If they can commit to length
> >>> being there in the future, we are all set.
> >>
> >> Nope, "I couldn't convince myself of any condition on the exit
> >> qualification where VM-exit instruction length "has" to be defined".  So
> >> assuming Intel can do it, it would only apply to future processors (2
> >> years+ for server SKUs).
> > 
> > Well maybe there's a reason it's actually working.  Let's see what can
> > be done.
> 
> Sure there is.  It just happens that the actual condition for VM-exit
> instruction length being set correctly is "the fault was taken after the
> accessing instruction has been decoded".  But there's no way according
> to the spec to detect whether that has happened.
> 
> While you can filter out instruction fetches, that's not enough.  A data
> read could happen because someone pointed the IDT to MMIO area, and who
> knows what the VM-exit instruction length points to in that case.

Interesting. We use it with MMIO so this kind of hack is not supposed to
work anyway. Does not seem to be worth slowing virtio for.


> >> Plus of course it wouldn't be guaranteed to work on nested.
> > 
> > Not sure I got this one.
> 
> Not all nested hypervisors are setting the VM-exit instruction length
> field on EPT violations, since it's documented not to be set.

I see. We could special case this by testing the hypervisor leaf
if we wanted to.

> >>>> Adding a hypercall or MSR write that does a fast MMIO write to a physical
> >>>> address would do it, but it adds hypervisor knowledge in virtio, including
> >>>> CPUID handling.
> >>>
> >>> Another issue is that it will break DPDK on virtio.
> >>
> >> Not break, just make it slower.
> > 
> > I thought hypercalls can only be triggered from ring 0, userspace can't call them.
> > Dod I get it wrong?
> 
> That's just a limitation that KVM makes on currently-defined hypercalls.
> 
> VMCALL causes a vmexit if executed from ring 3.
> 
> Paolo

OK but we will have to build our own protection mechanism then. With
MMIO we just used pagetables.

Michael S. Tsirkin Aug. 16, 2017, 2:06 p.m. UTC | #8

On Wed, Aug 16, 2017 at 03:37:47PM +0200, Paolo Bonzini wrote:
> On 16/08/2017 14:07, Radim Krčmář wrote:
> > 2017-08-16 13:22+0200, Paolo Bonzini:
> >> Microsoft pointed out privately to me that KVM's handling of
> >> KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
> >> in EPT misconfiguration vmexit handlers, because neither EPT violations
> >> nor misconfigurations are listed in the manual among the VM exits that
> >> set the VM-exit instruction length field.
> >>
> >> While physical processors seem to set the field, this is not architectural
> >> and is just a side effect of the implementation.  I couldn't convince
> >> myself of any condition on the exit qualification where VM-exit
> >> instruction length "has" to be defined; there are no trap-like VM-exits
> >> that can be repurposed; and fault-like VM-exits such as descriptor-table
> >> exits provide no decoding information.  So I don't really see any elegant
> >> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
> >> 1 will go slower.
> > 
> > Do you have some numbers?
> 
> Raw number from vmexit.flat on Haswell-EP:
> 
> mmio-no-eventfd:pci-mem 5793
> mmio-wildcard-eventfd:pci-mem 1395
> mmio-datamatch-eventfd:pci-mem 2268
> 
> So roughly 900 clock cycles.  Most of the work is the four memory reads
> done by x86_decode_insn, three to walk the page tables and one to fetch
> the instruction.
> 
> > We could keep the ugliness in KVM and add a new skip function with
> > emulate_instruction(vcpu, EMULTYPE_SKIP) to decode the length of the
> > instruction.  (Adding a condition just for EPT violation exit reason to
> > the existing skip function would be a dirtier solution.)
> > Slower than what we have now, but faster than full emulation.
> 
> This is actually a good idea, and not ugly at all!  The main cost is
> translating the physical address of the instruction and fetching the
> bytes, so only 200 clock cycles are saved.

We actually know what to expect (a write) so we could maybe
optimize this some more with a dedicated function just for this.

> 
> However, the eventfd is written before decoding, while full emulation
> would write it after. So while VCPU thread latency is worse compared to
> skip_emulated_instruction, latency to the iothread remains small.
> 
> Paolo

Paolo Bonzini Aug. 16, 2017, 2:17 p.m. UTC | #9

> We actually know what to expect (a write) so we could maybe
> optimize this some more with a dedicated function just for this.

We don't know the addressing mode, the size or the source (immediate
vs. register), so no.

KVM is already doing a single translation and read no matter how long
the instruction (unless it cross page boundaries).

Paolo

> > 
> > However, the eventfd is written before decoding, while full emulation
> > would write it after. So while VCPU thread latency is worse compared to
> > skip_emulated_instruction, latency to the iothread remains small.
> > 
> > Paolo
>

Michael S. Tsirkin Aug. 16, 2017, 4:50 p.m. UTC | #10

On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote:
> On 16/08/2017 15:16, Michael S. Tsirkin wrote:
> > On Wed, Aug 16, 2017 at 03:05:51PM +0200, Paolo Bonzini wrote:
> >> On 16/08/2017 14:58, Michael S. Tsirkin wrote:
> >>> On Wed, Aug 16, 2017 at 01:22:49PM +0200, Paolo Bonzini wrote:
> >>>> Microsoft pointed out privately to me that KVM's handling of
> >>>> KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
> >>>> in EPT misconfiguration vmexit handlers, because neither EPT violations
> >>>> nor misconfigurations are listed in the manual among the VM exits that
> >>>> set the VM-exit instruction length field.
> >>>>
> >>>> While physical processors seem to set the field, this is not architectural
> >>>> and is just a side effect of the implementation.  I couldn't convince
> >>>> myself of any condition on the exit qualification where VM-exit
> >>>> instruction length "has" to be defined; there are no trap-like VM-exits
> >>>> that can be repurposed; and fault-like VM-exits such as descriptor-table
> >>>> exits provide no decoding information.  So I don't really see any elegant
> >>>> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
> >>>> 1 will go slower.
> >>>
> >>> How about I will try asking Intel about it? If they can commit to length
> >>> being there in the future, we are all set.
> >>
> >> Nope, "I couldn't convince myself of any condition on the exit
> >> qualification where VM-exit instruction length "has" to be defined".  So
> >> assuming Intel can do it, it would only apply to future processors (2
> >> years+ for server SKUs).
> > 
> > Well maybe there's a reason it's actually working.  Let's see what can
> > be done.
> 
> Sure there is.  It just happens that the actual condition for VM-exit
> instruction length being set correctly is "the fault was taken after the
> accessing instruction has been decoded".  But there's no way according
> to the spec to detect whether that has happened.
> 
> While you can filter out instruction fetches, that's not enough.  A data
> read could happen because someone pointed the IDT to MMIO area, and who
> knows what the VM-exit instruction length points to in that case.

Thinking more about it, I don't really see how anything
legal guest might be doing with virtio would trigger anything
but a fault after decoding the instruction. How does
skipping instruction even make sense in the example you give?


Please note that
1. we have verified the address and we know it's the one
   that belongs to MMIO of a virtio device
2. only write accesses are allowed by the eventfd or virtio spec
   guests doing anything else are on their own.

Are there any reasonable examples on real hardware? Could you share this pls?


> >> Plus of course it wouldn't be guaranteed to work on nested.
> > 
> > Not sure I got this one.
> 
> Not all nested hypervisors are setting the VM-exit instruction length
> field on EPT violations, since it's documented not to be set.

So that's probably the real issue - nested virt which has to do it
in software at extra cost. We already limit this to intel processors,
how about we blacklist nested virt for this optimization?

I agree it's skating it a bit close to the dangerous edge,
but so are other tricks we play with PTEs to speed up MMIO.

> >>>> Adding a hypercall or MSR write that does a fast MMIO write to a physical
> >>>> address would do it, but it adds hypervisor knowledge in virtio, including
> >>>> CPUID handling.
> >>>
> >>> Another issue is that it will break DPDK on virtio.
> >>
> >> Not break, just make it slower.
> > 
> > I thought hypercalls can only be triggered from ring 0, userspace can't call them.
> > Dod I get it wrong?
> 
> That's just a limitation that KVM makes on currently-defined hypercalls.
> 
> VMCALL causes a vmexit if executed from ring 3.
> 
> Paolo

Paolo Bonzini Aug. 16, 2017, 5:19 p.m. UTC | #11

On 16/08/2017 18:50, Michael S. Tsirkin wrote:
> On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote:
>> While you can filter out instruction fetches, that's not enough.  A data
>> read could happen because someone pointed the IDT to MMIO area, and who
>> knows what the VM-exit instruction length points to in that case.
> 
> Thinking more about it, I don't really see how anything
> legal guest might be doing with virtio would trigger anything
> but a fault after decoding the instruction. How does
> skipping instruction even make sense in the example you give?

There's no such thing as a legal guest.  Anything that the hypervisor
does, that differs from real hardware, is a possible escalation path.

This in fact makes me doubt the EMULTYPE_SKIP patch too.

>>>> Plus of course it wouldn't be guaranteed to work on nested.
>>>
>>> Not sure I got this one.
>>
>> Not all nested hypervisors are setting the VM-exit instruction length
>> field on EPT violations, since it's documented not to be set.
> 
> So that's probably the real issue - nested virt which has to do it
> in software at extra cost. We already limit this to intel processors,
> how about we blacklist nested virt for this optimization?
> 
> I agree it's skating it a bit close to the dangerous edge,
> but so are other tricks we play with PTEs to speed up MMIO.

Not at all.  Everything else we do is perfectly fine according to the
spec, this one isn't.

Paolo

>>>>>> Adding a hypercall or MSR write that does a fast MMIO write to a physical
>>>>>> address would do it, but it adds hypervisor knowledge in virtio, including
>>>>>> CPUID handling.
>>>>>
>>>>> Another issue is that it will break DPDK on virtio.
>>>>
>>>> Not break, just make it slower.
>>>
>>> I thought hypercalls can only be triggered from ring 0, userspace can't call them.
>>> Dod I get it wrong?
>>
>> That's just a limitation that KVM makes on currently-defined hypercalls.
>>
>> VMCALL causes a vmexit if executed from ring 3.
>>
>> Paolo
>

Radim Krčmář Aug. 16, 2017, 7:03 p.m. UTC | #12

2017-08-16 19:19+0200, Paolo Bonzini:
> On 16/08/2017 18:50, Michael S. Tsirkin wrote:
>> On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote:
>>> While you can filter out instruction fetches, that's not enough.  A data
>>> read could happen because someone pointed the IDT to MMIO area, and who
>>> knows what the VM-exit instruction length points to in that case.
>> 
>> Thinking more about it, I don't really see how anything
>> legal guest might be doing with virtio would trigger anything
>> but a fault after decoding the instruction. How does
>> skipping instruction even make sense in the example you give?
> 
> There's no such thing as a legal guest.  Anything that the hypervisor
> does, that differs from real hardware, is a possible escalation path.
> 
> This in fact makes me doubt the EMULTYPE_SKIP patch too.

The main hack is that we expect EPT misconfig within a given range to be
a MMIO NULL write.  I think it is fine -- EMULTYPE_SKIP is a common path
that should have well tested error paths and, IIUC, virtio doesn't allow
any other access, so it is a problem of the guest if a buggy/malicious
application can access virtio memory.

>>>>> Plus of course it wouldn't be guaranteed to work on nested.
>>>>
>>>> Not sure I got this one.
>>>
>>> Not all nested hypervisors are setting the VM-exit instruction length
>>> field on EPT violations, since it's documented not to be set.
>> 
>> So that's probably the real issue - nested virt which has to do it
>> in software at extra cost. We already limit this to intel processors,

Hm, there is no reason to exclude SVM.

>> how about we blacklist nested virt for this optimization?

Not every hypervisor can be easily detected ... KVM uses standard
features and SDM clearly says that the instruction length field is
undefined.

We only lose performance if we decode the instruction, but piling
workarounds creates unexpected corner cases.

I still don't see acceptable alternatives to Paolo's solution.

Michael S. Tsirkin Aug. 16, 2017, 7:47 p.m. UTC | #13

On Wed, Aug 16, 2017 at 07:19:28PM +0200, Paolo Bonzini wrote:
> On 16/08/2017 18:50, Michael S. Tsirkin wrote:
> > On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote:
> >> While you can filter out instruction fetches, that's not enough.  A data
> >> read could happen because someone pointed the IDT to MMIO area, and who
> >> knows what the VM-exit instruction length points to in that case.
> > 
> > Thinking more about it, I don't really see how anything
> > legal guest might be doing with virtio would trigger anything
> > but a fault after decoding the instruction. How does
> > skipping instruction even make sense in the example you give?
> 
> There's no such thing as a legal guest.  Anything that the hypervisor
> does, that differs from real hardware, is a possible escalation path.

Fast MMIO bus devices don't apprear out of thin air.

They appear because guest enabled a virtio device.

So it is a PV guest and if it doesn't behave according to the virtio
spec, it is going to crash.


> 
> This in fact makes me doubt the EMULTYPE_SKIP patch too.
> 
> >>>> Plus of course it wouldn't be guaranteed to work on nested.
> >>>
> >>> Not sure I got this one.
> >>
> >> Not all nested hypervisors are setting the VM-exit instruction length
> >> field on EPT violations, since it's documented not to be set.
> > 
> > So that's probably the real issue - nested virt which has to do it
> > in software at extra cost. We already limit this to intel processors,
> > how about we blacklist nested virt for this optimization?
> > 
> > I agree it's skating it a bit close to the dangerous edge,
> > but so are other tricks we play with PTEs to speed up MMIO.
> 
> Not at all.  Everything else we do is perfectly fine according to the
> spec, this one isn't.
> 
> Paolo

Virtio MMIO is kind of special in many ways.

What happens if I map and try to execute an MMIO BAR? I don't think it
will work, will it?


> >>>>>> Adding a hypercall or MSR write that does a fast MMIO write to a physical
> >>>>>> address would do it, but it adds hypervisor knowledge in virtio, including
> >>>>>> CPUID handling.
> >>>>>
> >>>>> Another issue is that it will break DPDK on virtio.
> >>>>
> >>>> Not break, just make it slower.
> >>>
> >>> I thought hypercalls can only be triggered from ring 0, userspace can't call them.
> >>> Dod I get it wrong?
> >>
> >> That's just a limitation that KVM makes on currently-defined hypercalls.
> >>
> >> VMCALL causes a vmexit if executed from ring 3.
> >>
> >> Paolo
> >

Michael S. Tsirkin Aug. 16, 2017, 7:59 p.m. UTC | #14

On Wed, Aug 16, 2017 at 09:03:17PM +0200, Radim Krčmář wrote:
> 2017-08-16 19:19+0200, Paolo Bonzini:
> > On 16/08/2017 18:50, Michael S. Tsirkin wrote:
> >> On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote:
> >>> While you can filter out instruction fetches, that's not enough.  A data
> >>> read could happen because someone pointed the IDT to MMIO area, and who
> >>> knows what the VM-exit instruction length points to in that case.
> >> 
> >> Thinking more about it, I don't really see how anything
> >> legal guest might be doing with virtio would trigger anything
> >> but a fault after decoding the instruction. How does
> >> skipping instruction even make sense in the example you give?
> > 
> > There's no such thing as a legal guest.  Anything that the hypervisor
> > does, that differs from real hardware, is a possible escalation path.
> > 
> > This in fact makes me doubt the EMULTYPE_SKIP patch too.
> 
> The main hack is that we expect EPT misconfig within a given range to be
> a MMIO NULL write.  I think it is fine -- EMULTYPE_SKIP is a common path
> that should have well tested error paths and, IIUC, virtio doesn't allow
> any other access, so it is a problem of the guest if a buggy/malicious
> application can access virtio memory.
> 
> >>>>> Plus of course it wouldn't be guaranteed to work on nested.
> >>>>
> >>>> Not sure I got this one.
> >>>
> >>> Not all nested hypervisors are setting the VM-exit instruction length
> >>> field on EPT violations, since it's documented not to be set.
> >> 
> >> So that's probably the real issue - nested virt which has to do it
> >> in software at extra cost. We already limit this to intel processors,
> 
> Hm, there is no reason to exclude SVM.
> 
> >> how about we blacklist nested virt for this optimization?
> 
> Not every hypervisor can be easily detected ...

Hypervisors that don't set a hypervisor bit in CPUID are violating the
spec themselves, aren't they?  Anyway, we can add a management option
for use in a nested scenario.

> KVM uses standard
> features and SDM clearly says that the instruction length field is
> undefined.

True. Let's see whether intel can commit to a stronger definition.
I don't think there's any rush to make this change.


> We only lose performance if we decode the instruction, but piling
> workarounds creates unexpected corner cases.
> 
> I still don't see acceptable alternatives to Paolo's solution.

It's just that this has been there for 3 years and people have built a
product around this. It's not a feature you can discard out of hand out
of theoretical concerns or to improve niche use-cases such as nested
virt.

Paolo Bonzini Aug. 16, 2017, 9:25 p.m. UTC | #15

On 16/08/2017 21:59, Michael S. Tsirkin wrote:
> On Wed, Aug 16, 2017 at 09:03:17PM +0200, Radim Krčmář wrote:
>> 2017-08-16 19:19+0200, Paolo Bonzini:
>>> On 16/08/2017 18:50, Michael S. Tsirkin wrote:
>>>> On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote:
>>>>> While you can filter out instruction fetches, that's not enough.  A data
>>>>> read could happen because someone pointed the IDT to MMIO area, and who
>>>>> knows what the VM-exit instruction length points to in that case.
>>>>
>>>> Thinking more about it, I don't really see how anything
>>>> legal guest might be doing with virtio would trigger anything
>>>> but a fault after decoding the instruction. How does
>>>> skipping instruction even make sense in the example you give?
>>>
>>> There's no such thing as a legal guest.  Anything that the hypervisor
>>> does, that differs from real hardware, is a possible escalation path.
>>>
>>> This in fact makes me doubt the EMULTYPE_SKIP patch too.
>>
>> The main hack is that we expect EPT misconfig within a given range to be
>> a MMIO NULL write.  I think it is fine -- EMULTYPE_SKIP is a common path
>> that should have well tested error paths and, IIUC, virtio doesn't allow
>> any other access, so it is a problem of the guest if a buggy/malicious
>> application can access virtio memory.

Yes, I agree.  EMULTYPE_SKIP is fine because failed decoding still
causes an exception to be injected.  Maybe it's better to gate the
EMULTYPE_SKIP emulation on the exit qualification saying this is a write
and also not a page table walk---just in case.

>>>> how about we blacklist nested virt for this optimization?
>>
>> Not every hypervisor can be easily detected ...
> 
> Hypervisors that don't set a hypervisor bit in CPUID are violating the
> spec themselves, aren't they?  Anyway, we can add a management option
> for use in a nested scenario.

No, the hypervisor bit only says that CPUID leaf 0x40000000 is defined.
See for example
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458:
"Intel and AMD have also reserved CPUID leaves 0x40000000 - 0x400000FF
for software use. Hypervisors can use these leaves to provide an
interface to pass information from the hypervisor to the guest operating
system running inside a virtual machine. The hypervisor bit indicates
the presence of a hypervisor and that it is safe to test these
additional software leaves".

>> KVM uses standard features and SDM clearly says that the
>> instruction length field is undefined.
> 
> True. Let's see whether intel can commit to a stronger definition.
> I don't think there's any rush to make this change.

I disagree.  Relying on undefined processor features is a bad idea.

> It's just that this has been there for 3 years and people have built a
> product around this.

Around 700 clock cycles?

Paolo

Michael S. Tsirkin Aug. 16, 2017, 10:31 p.m. UTC | #16

On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote:
> On 16/08/2017 21:59, Michael S. Tsirkin wrote:
> > On Wed, Aug 16, 2017 at 09:03:17PM +0200, Radim Krčmář wrote:
> >> 2017-08-16 19:19+0200, Paolo Bonzini:
> >>> On 16/08/2017 18:50, Michael S. Tsirkin wrote:
> >>>> On Wed, Aug 16, 2017 at 03:30:31PM +0200, Paolo Bonzini wrote:
> >>>>> While you can filter out instruction fetches, that's not enough.  A data
> >>>>> read could happen because someone pointed the IDT to MMIO area, and who
> >>>>> knows what the VM-exit instruction length points to in that case.
> >>>>
> >>>> Thinking more about it, I don't really see how anything
> >>>> legal guest might be doing with virtio would trigger anything
> >>>> but a fault after decoding the instruction. How does
> >>>> skipping instruction even make sense in the example you give?
> >>>
> >>> There's no such thing as a legal guest.  Anything that the hypervisor
> >>> does, that differs from real hardware, is a possible escalation path.
> >>>
> >>> This in fact makes me doubt the EMULTYPE_SKIP patch too.
> >>
> >> The main hack is that we expect EPT misconfig within a given range to be
> >> a MMIO NULL write.  I think it is fine -- EMULTYPE_SKIP is a common path
> >> that should have well tested error paths and, IIUC, virtio doesn't allow
> >> any other access, so it is a problem of the guest if a buggy/malicious
> >> application can access virtio memory.
> 
> Yes, I agree.  EMULTYPE_SKIP is fine because failed decoding still
> causes an exception to be injected.  Maybe it's better to gate the
> EMULTYPE_SKIP emulation on the exit qualification saying this is a write

I thought it's already limited to writes. I agree that's a reasonable
limitation in any case.

> and also not a page table walk---just in case.

I still don't get it, sorry. Let's assume for the sake of argument
that it's a PT walk causing the MMIO access. Just why do you think
that it makes sense to skip the instruction that caused the walk?

> >>>> how about we blacklist nested virt for this optimization?
> >>
> >> Not every hypervisor can be easily detected ...
> > 
> > Hypervisors that don't set a hypervisor bit in CPUID are violating the
> > spec themselves, aren't they?  Anyway, we can add a management option
> > for use in a nested scenario.
> 
> No, the hypervisor bit only says that CPUID leaf 0x40000000 is defined.
> See for example
> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458:
> "Intel and AMD have also reserved CPUID leaves 0x40000000 - 0x400000FF
> for software use. Hypervisors can use these leaves to provide an
> interface to pass information from the hypervisor to the guest operating
> system running inside a virtual machine. The hypervisor bit indicates
> the presence of a hypervisor and that it is safe to test these
> additional software leaves".

Looks like it's not a bug then. Still, most hypervisors do have this
leaf so it's a reasonable way that will catch most issues.  We can
always blacklist more as they are found. Additionally let's go ahead
and add ability for userspace to disable fast MMIO for these
hypervisors we failed to detect.

> >> KVM uses standard features and SDM clearly says that the
> >> instruction length field is undefined.
> > 
> > True. Let's see whether intel can commit to a stronger definition.
> > I don't think there's any rush to make this change.
> 
> I disagree.  Relying on undefined processor features is a bad idea.

Maybe it was a bad idea 3 years ago, yes. In 2012 I posted "kvm_para:
add mmio word store hypercall" as an alternative.  Was nacked as MMIO
was seen as safer and better. By now many people rely on mmio being
fast.  Let's talk to hardware guys to define the feature before we give
up and spend years designing an alternative.

> > It's just that this has been there for 3 years and people have built a
> > product around this.
> 
> Around 700 clock cycles?
> 
> Paolo

About 30% the cost of exit, isn't it?  There are definitely workloads
where cost of exit gates performance. We didn't work on fast mmio based
on theoretical assumptions. But maybe I am wrong. We'll see. Jason here
volunteered to test your patch and we'll see what comes out of it. If
I'm wrong and it's about 1%, I won't split hairs.

David Hildenbrand Aug. 17, 2017, 8:15 a.m. UTC | #17

On 16.08.2017 14:07, Radim Krčmář wrote:
> 2017-08-16 13:22+0200, Paolo Bonzini:
>> Microsoft pointed out privately to me that KVM's handling of
>> KVM_FAST_MMIO_BUS is invalid.  Using skip_emulation_instruction is invalid
>> in EPT misconfiguration vmexit handlers, because neither EPT violations
>> nor misconfigurations are listed in the manual among the VM exits that
>> set the VM-exit instruction length field.
>>
>> While physical processors seem to set the field, this is not architectural
>> and is just a side effect of the implementation.  I couldn't convince
>> myself of any condition on the exit qualification where VM-exit
>> instruction length "has" to be defined; there are no trap-like VM-exits
>> that can be repurposed; and fault-like VM-exits such as descriptor-table
>> exits provide no decoding information.  So I don't really see any elegant
>> way to fix it except by disabling KVM_FAST_MMIO_BUS, which means virtio
>> 1 will go slower.
> 
> Do you have some numbers?
> 
> We could keep the ugliness in KVM and add a new skip function with
> emulate_instruction(vcpu, EMULTYPE_SKIP) to decode the length of the
> instruction.  (Adding a condition just for EPT violation exit reason to

I like that idea.

> the existing skip function would be a dirtier solution.)
> Slower than what we have now, but faster than full emulation.
> 
> I agree that configuring EPT to throw a violation when accessing fast
> MMIO has many drawbacks (although it seems to be what Intel expected).
> 
> Thanks.

Paolo Bonzini Aug. 17, 2017, 9 a.m. UTC | #18

On 17/08/2017 00:31, Michael S. Tsirkin wrote:
> On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote:
>> Yes, I agree.  EMULTYPE_SKIP is fine because failed decoding still
>> causes an exception to be injected.  Maybe it's better to gate the
>> EMULTYPE_SKIP emulation on the exit qualification saying this is a write
> 
> I thought it's already limited to writes. I agree that's a reasonable
> limitation in any case.
> 
>> and also not a page table walk---just in case.
> 
> I still don't get it, sorry. Let's assume for the sake of argument
> that it's a PT walk causing the MMIO access. Just why do you think
> that it makes sense to skip the instruction that caused the walk?

I think it doesn't.  I think in that case it's better to skip the fast
write and proceed with full emulation.

>>> It's just that this has been there for 3 years and people have built a
>>> product around this.
>>
>> Around 700 clock cycles?
> 
> About 30% the cost of exit, isn't it?  There are definitely workloads
> where cost of exit gates performance. We didn't work on fast mmio based
> on theoretical assumptions. But maybe I am wrong. We'll see. Jason here
> volunteered to test your patch and we'll see what comes out of it. If
> I'm wrong and it's about 1%, I won't split hairs.

Note that we still get the latency benefit from fast MMIO, and maybe we
can cut a couple hundred clock cycles more---which would benefit all
emulation, not just fast MMIO.

Paolo

Paolo Bonzini Aug. 17, 2017, 12:14 p.m. UTC | #19

On 17/08/2017 11:00, Paolo Bonzini wrote:
> On 17/08/2017 00:31, Michael S. Tsirkin wrote:
>> On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote:
>>> Yes, I agree.  EMULTYPE_SKIP is fine because failed decoding still
>>> causes an exception to be injected.  Maybe it's better to gate the
>>> EMULTYPE_SKIP emulation on the exit qualification saying this is a write
>>
>> I thought it's already limited to writes. I agree that's a reasonable
>> limitation in any case.
>>
>>> and also not a page table walk---just in case.
>>
>> I still don't get it, sorry. Let's assume for the sake of argument
>> that it's a PT walk causing the MMIO access. Just why do you think
>> that it makes sense to skip the instruction that caused the walk?
> 
> I think it doesn't.  I think in that case it's better to skip the fast
> write and proceed with full emulation.

... nope, exit qualification is just zero for EPT misconfigurations, so
we cannot do this.

Paolo

Radim Krčmář Aug. 17, 2017, 1:23 p.m. UTC | #20

2017-08-17 14:14+0200, Paolo Bonzini:
> On 17/08/2017 11:00, Paolo Bonzini wrote:
> > On 17/08/2017 00:31, Michael S. Tsirkin wrote:
> >> On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote:
> >>> Yes, I agree.  EMULTYPE_SKIP is fine because failed decoding still
> >>> causes an exception to be injected.  Maybe it's better to gate the
> >>> EMULTYPE_SKIP emulation on the exit qualification saying this is a write
> >>
> >> I thought it's already limited to writes. I agree that's a reasonable
> >> limitation in any case.
> >>
> >>> and also not a page table walk---just in case.
> >>
> >> I still don't get it, sorry. Let's assume for the sake of argument
> >> that it's a PT walk causing the MMIO access. Just why do you think
> >> that it makes sense to skip the instruction that caused the walk?
> > 
> > I think it doesn't.  I think in that case it's better to skip the fast
> > write and proceed with full emulation.

It definitely doesn't.  I think it's possible only because virtio
forbids other access types, so we assume that every access is a write
access.

> ... nope, exit qualification is just zero for EPT misconfigurations, so
> we cannot do this.

Yeah, we'd need to use EPT violation to get that information.

Radim Krčmář Aug. 17, 2017, 1:51 p.m. UTC | #21

2017-08-17 01:31+0300, Michael S. Tsirkin:
> On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote:
> > On 16/08/2017 21:59, Michael S. Tsirkin wrote:
> > > On Wed, Aug 16, 2017 at 09:03:17PM +0200, Radim Krčmář wrote:
> > >>>> how about we blacklist nested virt for this optimization?
> > >>
> > >> Not every hypervisor can be easily detected ...
> > > 
> > > Hypervisors that don't set a hypervisor bit in CPUID are violating the
> > > spec themselves, aren't they?  Anyway, we can add a management option
> > > for use in a nested scenario.
> > 
> > No, the hypervisor bit only says that CPUID leaf 0x40000000 is defined.
> > See for example
> > https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458:
> > "Intel and AMD have also reserved CPUID leaves 0x40000000 - 0x400000FF
> > for software use. Hypervisors can use these leaves to provide an
> > interface to pass information from the hypervisor to the guest operating
> > system running inside a virtual machine. The hypervisor bit indicates
> > the presence of a hypervisor and that it is safe to test these
> > additional software leaves".
> 
> Looks like it's not a bug then. Still, most hypervisors do have this
> leaf so it's a reasonable way that will catch most issues.  We can
> always blacklist more as they are found. Additionally let's go ahead
> and add ability for userspace to disable fast MMIO for these
> hypervisors we failed to detect.

In the worst case, I'd make faster mmio an opt-in unsafe feature
regardless of what we run on.  Users that just want KVM to work get the
default and people who care about utmost performance can jump through
loops.

> > >> KVM uses standard features and SDM clearly says that the
> > >> instruction length field is undefined.
> > > 
> > > True. Let's see whether intel can commit to a stronger definition.
> > > I don't think there's any rush to make this change.
> > 
> > I disagree.  Relying on undefined processor features is a bad idea.
> 
> Maybe it was a bad idea 3 years ago, yes. In 2012 I posted "kvm_para:
> add mmio word store hypercall" as an alternative.  Was nacked as MMIO
> was seen as safer and better. By now many people rely on mmio being
> fast.  Let's talk to hardware guys to define the feature before we give
> up and spend years designing an alternative.

The change is not backward-compatible wrt. SDM, but all processors might
actually be behaving like we want ...  (I'd assert undefined behavior
add a vm-exit flag if I were to allow it, though.)

> > > It's just that this has been there for 3 years and people have built a
> > > product around this.
> > 
> > Around 700 clock cycles?
> > 
> > Paolo
> 
> About 30% the cost of exit, isn't it?  There are definitely workloads
> where cost of exit gates performance. We didn't work on fast mmio based
> on theoretical assumptions. But maybe I am wrong. We'll see. Jason here
> volunteered to test your patch and we'll see what comes out of it. If
> I'm wrong and it's about 1%, I won't split hairs.

I'm ok with waiting for the numbers as I hope that we won't have to
resort to adding special cases.

Michael S. Tsirkin Aug. 17, 2017, 3:15 p.m. UTC | #22

On Thu, Aug 17, 2017 at 11:00:01AM +0200, Paolo Bonzini wrote:
> >> and also not a page table walk---just in case.
> > 
> > I still don't get it, sorry. Let's assume for the sake of argument
> > that it's a PT walk causing the MMIO access. Just why do you think
> > that it makes sense to skip the instruction that caused the walk?
> 
> I think it doesn't.  I think in that case it's better to skip the fast
> write and proceed with full emulation.

Right. So my argument is that fast mmio is used for paravirtualization
exclusively. Thus someone doing anything that isn't a memory write to
trigger it gets to keep both pieces.

I get it that even writes don't work in some nested virt setups, and
this seems worth fixing, and I get it that we have inadvertently relied
on an undocumented behaviour.

I would like to understand whether you believe that on bare metal and
when accessing MMIO using writes exclusively, and after checking the
address matches one of the paravirtualized device, there's still a real
chance length is invalid, or is it more a question of missing
documentation.

Michael S. Tsirkin Aug. 17, 2017, 3:27 p.m. UTC | #23

On Thu, Aug 17, 2017 at 03:51:31PM +0200, Radim Krčmář wrote:
> 2017-08-17 01:31+0300, Michael S. Tsirkin:
> > On Wed, Aug 16, 2017 at 11:25:35PM +0200, Paolo Bonzini wrote:
> > > On 16/08/2017 21:59, Michael S. Tsirkin wrote:
> > > > On Wed, Aug 16, 2017 at 09:03:17PM +0200, Radim Krčmář wrote:
> > > >>>> how about we blacklist nested virt for this optimization?
> > > >>
> > > >> Not every hypervisor can be easily detected ...
> > > > 
> > > > Hypervisors that don't set a hypervisor bit in CPUID are violating the
> > > > spec themselves, aren't they?  Anyway, we can add a management option
> > > > for use in a nested scenario.
> > > 
> > > No, the hypervisor bit only says that CPUID leaf 0x40000000 is defined.
> > > See for example
> > > https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009458:
> > > "Intel and AMD have also reserved CPUID leaves 0x40000000 - 0x400000FF
> > > for software use. Hypervisors can use these leaves to provide an
> > > interface to pass information from the hypervisor to the guest operating
> > > system running inside a virtual machine. The hypervisor bit indicates
> > > the presence of a hypervisor and that it is safe to test these
> > > additional software leaves".
> > 
> > Looks like it's not a bug then. Still, most hypervisors do have this
> > leaf so it's a reasonable way that will catch most issues.  We can
> > always blacklist more as they are found. Additionally let's go ahead
> > and add ability for userspace to disable fast MMIO for these
> > hypervisors we failed to detect.
> 
> In the worst case, I'd make faster mmio an opt-in unsafe feature
> regardless of what we run on.  Users that just want KVM to work get the
> default and people who care about utmost performance can jump through
> loops.

Might have been reasonable originally but you do not want to slow
down everyone and then make them jump through hoops just to get
back where they were originally.

So I think it has to be opt-out, really for nested setups
at this point.

> > > >> KVM uses standard features and SDM clearly says that the
> > > >> instruction length field is undefined.
> > > > 
> > > > True. Let's see whether intel can commit to a stronger definition.
> > > > I don't think there's any rush to make this change.
> > > 
> > > I disagree.  Relying on undefined processor features is a bad idea.
> > 
> > Maybe it was a bad idea 3 years ago, yes. In 2012 I posted "kvm_para:
> > add mmio word store hypercall" as an alternative.  Was nacked as MMIO
> > was seen as safer and better. By now many people rely on mmio being
> > fast.  Let's talk to hardware guys to define the feature before we give
> > up and spend years designing an alternative.
> 
> The change is not backward-compatible wrt. SDM, but all processors might
> actually be behaving like we want ...  (I'd assert undefined behavior
> add a vm-exit flag if I were to allow it, though.)
> 
> > > > It's just that this has been there for 3 years and people have built a
> > > > product around this.
> > > 
> > > Around 700 clock cycles?
> > > 
> > > Paolo
> > 
> > About 30% the cost of exit, isn't it?  There are definitely workloads
> > where cost of exit gates performance. We didn't work on fast mmio based
> > on theoretical assumptions. But maybe I am wrong. We'll see. Jason here
> > volunteered to test your patch and we'll see what comes out of it. If
> > I'm wrong and it's about 1%, I won't split hairs.
> 
> I'm ok with waiting for the numbers as I hope that we won't have to
> resort to adding special cases.

kvm: x86: disable KVM_FAST_MMIO_BUS

Commit Message

Comments

Patch