diff mbox series

[1/2] KVM: X86: Single target IPI fastpath

Message ID 1573283135-5502-1-git-send-email-wanpengli@tencent.com (mailing list archive)
State New, archived
Headers show
Series [1/2] KVM: X86: Single target IPI fastpath | expand

Commit Message

Wanpeng Li Nov. 9, 2019, 7:05 a.m. UTC
From: Wanpeng Li <wanpengli@tencent.com>

This patch tries to optimize x2apic physical destination mode, fixed delivery
mode single target IPI by delivering IPI to receiver immediately after sender
writes ICR vmexit to avoid various checks when possible.

Testing on Xeon Skylake server:

The virtual IPI latency from sender send to receiver receive reduces more than
330+ cpu cycles.

Running hackbench(reschedule ipi) in the guest, the avg handle time of MSR_WRITE
caused vmexit reduces more than 1000+ cpu cycles:

Before patch:

  VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
MSR_WRITE    5417390    90.01%    16.31%      0.69us    159.60us    1.08us

After patch:

  VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
MSR_WRITE    6726109    90.73%    62.18%      0.48us    191.27us    0.58us

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/vmx/vmx.c   | 39 +++++++++++++++++++++++++++++++++++++--
 include/linux/kvm_host.h |  1 +
 2 files changed, 38 insertions(+), 2 deletions(-)

Comments

Vitaly Kuznetsov Nov. 11, 2019, 1:06 p.m. UTC | #1
Wanpeng Li <kernellwp@gmail.com> writes:

> +
>  static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -6615,6 +6645,12 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  				  | (1 << VCPU_EXREG_CR3));
>  	vcpu->arch.regs_dirty = 0;
>  
> +	vmx->exit_reason = vmx->fail ? 0xdead : vmcs_read32(VM_EXIT_REASON);
> +	vcpu->fast_vmexit = false;
> +	if (!is_guest_mode(vcpu) &&
> +		vmx->exit_reason == EXIT_REASON_MSR_WRITE)
> +		vcpu->fast_vmexit = handle_ipi_fastpath(vcpu);

I have to admit this looks too much to me :-( Yes, I see the benefits of
taking a shortcut (by actualy penalizing all other MSR writes) but the
question I have is: where do we stop?

Also, this 'shortcut' creates an imbalance in tracing: you don't go down
to kvm_emulate_wrmsr() so handle_ipi_fastpath() should probably gain a
tracepoint.

Looking at 'fast_vmexit' name makes me think this is something
generic. Is this so? Maybe we can create some sort of an infrastructure
for fast vmexit handling and make it easy to hook things up to it?

(Maybe it's just me who thinks the codebase is already too complex,
let's see what Paolo and other reviewers have to say).
Paolo Bonzini Nov. 11, 2019, 9:59 p.m. UTC | #2
On 09/11/19 08:05, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> This patch tries to optimize x2apic physical destination mode, fixed delivery
> mode single target IPI by delivering IPI to receiver immediately after sender
> writes ICR vmexit to avoid various checks when possible.
> 
> Testing on Xeon Skylake server:
> 
> The virtual IPI latency from sender send to receiver receive reduces more than
> 330+ cpu cycles.
> 
> Running hackbench(reschedule ipi) in the guest, the avg handle time of MSR_WRITE
> caused vmexit reduces more than 1000+ cpu cycles:
> 
> Before patch:
> 
>   VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> MSR_WRITE    5417390    90.01%    16.31%      0.69us    159.60us    1.08us
> 
> After patch:
> 
>   VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> MSR_WRITE    6726109    90.73%    62.18%      0.48us    191.27us    0.58us

Do you have retpolines enabled?  The bulk of the speedup might come just
from the indirect jump.

Paolo
Wanpeng Li Nov. 12, 2019, 1:18 a.m. UTC | #3
On Mon, 11 Nov 2019 at 21:06, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
> Wanpeng Li <kernellwp@gmail.com> writes:
>
> > +
> >  static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >  {
> >       struct vcpu_vmx *vmx = to_vmx(vcpu);
> > @@ -6615,6 +6645,12 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >                                 | (1 << VCPU_EXREG_CR3));
> >       vcpu->arch.regs_dirty = 0;
> >
> > +     vmx->exit_reason = vmx->fail ? 0xdead : vmcs_read32(VM_EXIT_REASON);
> > +     vcpu->fast_vmexit = false;
> > +     if (!is_guest_mode(vcpu) &&
> > +             vmx->exit_reason == EXIT_REASON_MSR_WRITE)
> > +             vcpu->fast_vmexit = handle_ipi_fastpath(vcpu);
>
> I have to admit this looks too much to me :-( Yes, I see the benefits of
> taking a shortcut (by actualy penalizing all other MSR writes) but the
> question I have is: where do we stop?

In our iaas environment observation, ICR and TSCDEADLINE are the main
MSR write vmexits.

Before patch:
tscdeadline_immed 3900
tscdeadline 5413

After patch:
tscdeadline_immed 3912
tscdeadline 5427

So the penalize can be tolerated.

>
> Also, this 'shortcut' creates an imbalance in tracing: you don't go down
> to kvm_emulate_wrmsr() so handle_ipi_fastpath() should probably gain a
> tracepoint.

Agreed.

>
> Looking at 'fast_vmexit' name makes me think this is something
> generic. Is this so? Maybe we can create some sort of an infrastructure
> for fast vmexit handling and make it easy to hook things up to it?

Maybe an indirect jump? But I can have a try.

    Wanpeng
Wanpeng Li Nov. 12, 2019, 1:33 a.m. UTC | #4
On Tue, 12 Nov 2019 at 05:59, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 09/11/19 08:05, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > This patch tries to optimize x2apic physical destination mode, fixed delivery
> > mode single target IPI by delivering IPI to receiver immediately after sender
> > writes ICR vmexit to avoid various checks when possible.
> >
> > Testing on Xeon Skylake server:
> >
> > The virtual IPI latency from sender send to receiver receive reduces more than
> > 330+ cpu cycles.
> >
> > Running hackbench(reschedule ipi) in the guest, the avg handle time of MSR_WRITE
> > caused vmexit reduces more than 1000+ cpu cycles:
> >
> > Before patch:
> >
> >   VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> > MSR_WRITE    5417390    90.01%    16.31%      0.69us    159.60us    1.08us
> >
> > After patch:
> >
> >   VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> > MSR_WRITE    6726109    90.73%    62.18%      0.48us    191.27us    0.58us
>
> Do you have retpolines enabled?  The bulk of the speedup might come just
> from the indirect jump.

Adding 'mitigations=off' to the host grub parameter:

Before patch:

    VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
MSR_WRITE    2681713    92.98%    77.52%      0.38us     18.54us
0.73us ( +-   0.02% )

After patch:

    VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
MSR_WRITE    2953447    92.48%    62.47%      0.30us     59.09us
0.40us ( +-   0.02% )

Actually, this is not the first attempt to add shortcut for MSR writes
which performance sensitive, the other effort is tscdeadline timer
from Isaku Yamahata, https://patchwork.kernel.org/cover/10541035/ ,
ICR and TSCDEADLINE MSR writes cause the main MSR write vmexits in our
product observation, multicast IPIs are not as common as unicast IPI
like RESCHEDULE_VECTOR and CALL_FUNCTION_SINGLE_VECTOR etc. As far as
I know, something similar to this patch has already been deployed in
some cloud companies private kvm fork.

    Wanpeng
Wanpeng Li Nov. 13, 2019, 6:05 a.m. UTC | #5
On Tue, 12 Nov 2019 at 09:33, Wanpeng Li <kernellwp@gmail.com> wrote:
>
> On Tue, 12 Nov 2019 at 05:59, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 09/11/19 08:05, Wanpeng Li wrote:
> > > From: Wanpeng Li <wanpengli@tencent.com>
> > >
> > > This patch tries to optimize x2apic physical destination mode, fixed delivery
> > > mode single target IPI by delivering IPI to receiver immediately after sender
> > > writes ICR vmexit to avoid various checks when possible.
> > >
> > > Testing on Xeon Skylake server:
> > >
> > > The virtual IPI latency from sender send to receiver receive reduces more than
> > > 330+ cpu cycles.
> > >
> > > Running hackbench(reschedule ipi) in the guest, the avg handle time of MSR_WRITE
> > > caused vmexit reduces more than 1000+ cpu cycles:
> > >
> > > Before patch:
> > >
> > >   VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> > > MSR_WRITE    5417390    90.01%    16.31%      0.69us    159.60us    1.08us
> > >
> > > After patch:
> > >
> > >   VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> > > MSR_WRITE    6726109    90.73%    62.18%      0.48us    191.27us    0.58us
> >
> > Do you have retpolines enabled?  The bulk of the speedup might come just
> > from the indirect jump.
>
> Adding 'mitigations=off' to the host grub parameter:
>
> Before patch:
>
>     VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> MSR_WRITE    2681713    92.98%    77.52%      0.38us     18.54us
> 0.73us ( +-   0.02% )
>
> After patch:
>
>     VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> MSR_WRITE    2953447    92.48%    62.47%      0.30us     59.09us
> 0.40us ( +-   0.02% )
>
> Actually, this is not the first attempt to add shortcut for MSR writes
> which performance sensitive, the other effort is tscdeadline timer
> from Isaku Yamahata, https://patchwork.kernel.org/cover/10541035/ ,
> ICR and TSCDEADLINE MSR writes cause the main MSR write vmexits in our
> product observation, multicast IPIs are not as common as unicast IPI
> like RESCHEDULE_VECTOR and CALL_FUNCTION_SINGLE_VECTOR etc. As far as
> I know, something similar to this patch has already been deployed in
> some cloud companies private kvm fork.

Hi Paolo,

Do you think I should continue for this?

    Wanpeng
Wanpeng Li Nov. 14, 2019, 3:12 a.m. UTC | #6
On Tue, 12 Nov 2019 at 09:33, Wanpeng Li <kernellwp@gmail.com> wrote:
>
> On Tue, 12 Nov 2019 at 05:59, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 09/11/19 08:05, Wanpeng Li wrote:
> > > From: Wanpeng Li <wanpengli@tencent.com>
> > >
> > > This patch tries to optimize x2apic physical destination mode, fixed delivery
> > > mode single target IPI by delivering IPI to receiver immediately after sender
> > > writes ICR vmexit to avoid various checks when possible.
> > >
> > > Testing on Xeon Skylake server:
> > >
> > > The virtual IPI latency from sender send to receiver receive reduces more than
> > > 330+ cpu cycles.
> > >
> > > Running hackbench(reschedule ipi) in the guest, the avg handle time of MSR_WRITE
> > > caused vmexit reduces more than 1000+ cpu cycles:
> > >
> > > Before patch:
> > >
> > >   VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> > > MSR_WRITE    5417390    90.01%    16.31%      0.69us    159.60us    1.08us
> > >
> > > After patch:
> > >
> > >   VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> > > MSR_WRITE    6726109    90.73%    62.18%      0.48us    191.27us    0.58us
> >
> > Do you have retpolines enabled?  The bulk of the speedup might come just
> > from the indirect jump.
>
> Adding 'mitigations=off' to the host grub parameter:
>
> Before patch:
>
>     VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> MSR_WRITE    2681713    92.98%    77.52%      0.38us     18.54us
> 0.73us ( +-   0.02% )
>
> After patch:
>
>     VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time   Avg time
> MSR_WRITE    2953447    92.48%    62.47%      0.30us     59.09us
> 0.40us ( +-   0.02% )

Hmm, sender side less vmexit time is due to kvm_exit tracepoint is
still left in vmx_handle_exit, and ICR wrmsr is moved ahead, that is
why the time between kvm_exit tracepoint and kvm_entry tracepoint is
reduced.  But the virtual IPI latency still can reduce 330+ cycles.

    Wanpeng
Paolo Bonzini Nov. 14, 2019, 11:58 a.m. UTC | #7
Ok, it's not _so_ ugly after all.

> ---
>  arch/x86/kvm/vmx/vmx.c   | 39 +++++++++++++++++++++++++++++++++++++--
>  include/linux/kvm_host.h |  1 +
>  2 files changed, 38 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 5d21a4a..5c67061 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5924,7 +5924,9 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
>  		}
>  	}
>  
> -	if (exit_reason < kvm_vmx_max_exit_handlers
> +	if (vcpu->fast_vmexit)
> +		return 1;
> +	else if (exit_reason < kvm_vmx_max_exit_handlers

Instead of a separate vcpu->fast_vmexit, perhaps you can set exit_reason
to vmx->exit_reason to -1 if the fast path succeeds.

> +			if (ret == 0)
> +				ret = kvm_skip_emulated_instruction(vcpu);

Please move the "kvm_skip_emulated_instruction(vcpu)" to
vmx_handle_exit, so that this basically is

#define EXIT_REASON_NEED_SKIP_EMULATED_INSN -1

	if (ret == 0)
		vcpu->exit_reason = EXIT_REASON_NEED_SKIP_EMULATED_INSN;

and handle_ipi_fastpath can return void.

Thanks,

Paolo

> +		};
> +	};
> +
> +	return ret;
> +}
> +
>  static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -6615,6 +6645,12 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  				  | (1 << VCPU_EXREG_CR3));
>  	vcpu->arch.regs_dirty = 0;
>  
> +	vmx->exit_reason = vmx->fail ? 0xdead : vmcs_read32(VM_EXIT_REASON);
> +	vcpu->fast_vmexit = false;
> +	if (!is_guest_mode(vcpu) &&
> +		vmx->exit_reason == EXIT_REASON_MSR_WRITE)
> +		vcpu->fast_vmexit = handle_ipi_fastpath(vcpu);

This should be done later, at least after kvm_put_guest_xcr0, because
running with partially-loaded guest state is harder to audit.  The best
place to put it actually is right after the existing vmx->exit_reason
assignment, where we already handle EXIT_REASON_MCE_DURING_VMENTRY.

>  	pt_guest_exit(vmx);
>  
>  	/*
> @@ -6634,7 +6670,6 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	vmx->nested.nested_run_pending = 0;
>  	vmx->idt_vectoring_info = 0;
>  
> -	vmx->exit_reason = vmx->fail ? 0xdead : vmcs_read32(VM_EXIT_REASON);
>  	if ((u16)vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY)
>  		kvm_machine_check();
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 719fc3e..7a7358b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -319,6 +319,7 @@ struct kvm_vcpu {
>  #endif
>  	bool preempted;
>  	bool ready;
> +	bool fast_vmexit;
>  	struct kvm_vcpu_arch arch;
>  	struct dentry *debugfs_dentry;
>  };
>
Sean Christopherson Nov. 14, 2019, 3:22 p.m. UTC | #8
On Thu, Nov 14, 2019 at 12:58:56PM +0100, Paolo Bonzini wrote:
> Ok, it's not _so_ ugly after all.
> 
> > ---
> >  arch/x86/kvm/vmx/vmx.c   | 39 +++++++++++++++++++++++++++++++++++++--
> >  include/linux/kvm_host.h |  1 +
> >  2 files changed, 38 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 5d21a4a..5c67061 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -5924,7 +5924,9 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
> >  		}
> >  	}
> >  
> > -	if (exit_reason < kvm_vmx_max_exit_handlers
> > +	if (vcpu->fast_vmexit)
> > +		return 1;
> > +	else if (exit_reason < kvm_vmx_max_exit_handlers
> 
> Instead of a separate vcpu->fast_vmexit, perhaps you can set exit_reason
> to vmx->exit_reason to -1 if the fast path succeeds.

Actually, rather than make this super special case, what about moving the
handling into vmx_handle_exit_irqoff()?  Practically speaking that would
only add ~50 cycles (two VMREADs) relative to the code being run right
after kvm_put_guest_xcr0().  It has the advantage of restoring the host's
hardware breakpoints, preserving a semi-accurate last_guest_tsc, and
running with vcpu->mode set back to OUTSIDE_GUEST_MODE.  Hopefully it'd
also be more intuitive for people unfamiliar with the code.

> 

> > +			if (ret == 0)
> > +				ret = kvm_skip_emulated_instruction(vcpu);
> 
> Please move the "kvm_skip_emulated_instruction(vcpu)" to
> vmx_handle_exit, so that this basically is
> 
> #define EXIT_REASON_NEED_SKIP_EMULATED_INSN -1
> 
> 	if (ret == 0)
> 		vcpu->exit_reason = EXIT_REASON_NEED_SKIP_EMULATED_INSN;
> 
> and handle_ipi_fastpath can return void.

I'd rather we add a dedicated variable to say the exit has already been
handled.  Overloading exit_reason is bound to cause confusion, and that's
probably a best case scenario.

> > +		};
> > +	};
> > +
> > +	return ret;
> > +}
> > +
> >  static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >  {
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > @@ -6615,6 +6645,12 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >  				  | (1 << VCPU_EXREG_CR3));
> >  	vcpu->arch.regs_dirty = 0;
> >  
> > +	vmx->exit_reason = vmx->fail ? 0xdead : vmcs_read32(VM_EXIT_REASON);
> > +	vcpu->fast_vmexit = false;
> > +	if (!is_guest_mode(vcpu) &&
> > +		vmx->exit_reason == EXIT_REASON_MSR_WRITE)
> > +		vcpu->fast_vmexit = handle_ipi_fastpath(vcpu);
> 
> This should be done later, at least after kvm_put_guest_xcr0, because
> running with partially-loaded guest state is harder to audit.  The best
> place to put it actually is right after the existing vmx->exit_reason
> assignment, where we already handle EXIT_REASON_MCE_DURING_VMENTRY.
> 
> >  	pt_guest_exit(vmx);
> >  
> >  	/*
> > @@ -6634,7 +6670,6 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >  	vmx->nested.nested_run_pending = 0;
> >  	vmx->idt_vectoring_info = 0;
> >  
> > -	vmx->exit_reason = vmx->fail ? 0xdead : vmcs_read32(VM_EXIT_REASON);
> >  	if ((u16)vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY)
> >  		kvm_machine_check();
> >  
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 719fc3e..7a7358b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -319,6 +319,7 @@ struct kvm_vcpu {
> >  #endif
> >  	bool preempted;
> >  	bool ready;
> > +	bool fast_vmexit;
> >  	struct kvm_vcpu_arch arch;
> >  	struct dentry *debugfs_dentry;
> >  };
> > 
>
Paolo Bonzini Nov. 14, 2019, 3:44 p.m. UTC | #9
On 14/11/19 16:22, Sean Christopherson wrote:
>> Instead of a separate vcpu->fast_vmexit, perhaps you can set exit_reason
>> to vmx->exit_reason to -1 if the fast path succeeds.
> 
> Actually, rather than make this super special case, what about moving the
> handling into vmx_handle_exit_irqoff()?  Practically speaking that would
> only add ~50 cycles (two VMREADs) relative to the code being run right
> after kvm_put_guest_xcr0().  It has the advantage of restoring the host's
> hardware breakpoints, preserving a semi-accurate last_guest_tsc, and
> running with vcpu->mode set back to OUTSIDE_GUEST_MODE.  Hopefully it'd
> also be more intuitive for people unfamiliar with the code.

Yes, that's a good idea.  The expensive bit between handle_exit_irqoff
and handle_exit is srcu_read_lock, which has two memory barriers in it.


>>> +			if (ret == 0)
>>> +				ret = kvm_skip_emulated_instruction(vcpu);
>> Please move the "kvm_skip_emulated_instruction(vcpu)" to
>> vmx_handle_exit, so that this basically is
>>
>> #define EXIT_REASON_NEED_SKIP_EMULATED_INSN -1
>>
>> 	if (ret == 0)
>> 		vcpu->exit_reason = EXIT_REASON_NEED_SKIP_EMULATED_INSN;
>>
>> and handle_ipi_fastpath can return void.
>
> I'd rather we add a dedicated variable to say the exit has already been
> handled.  Overloading exit_reason is bound to cause confusion, and that's
> probably a best case scenario.

I proposed the fake exit reason to avoid a ternary return code from
handle_ipi_fastpath (return to guest, return to userspace, call
kvm_x86_ops->handle_exit), which Wanpeng's patch was mishandling.

To ensure confusion does not become the best case scenario, perhaps it
is worth trying to push exit_reason into vcpu_enter_guest's stack.
vcpu_enter_guest can pass a pointer to it, and then it can be passed
back into kvm_x86_ops->handle_exit{,_irqoff}.  It could be a struct too,
instead of just a bare u32.

This would ensure at compile-time that exit_reason is not accessed
outside the short path from vmexit to kvm_x86_ops->handle_exit.

Paolo
Sean Christopherson Nov. 14, 2019, 4:34 p.m. UTC | #10
On Thu, Nov 14, 2019 at 04:44:33PM +0100, Paolo Bonzini wrote:
> On 14/11/19 16:22, Sean Christopherson wrote:
> >> Instead of a separate vcpu->fast_vmexit, perhaps you can set exit_reason
> >> to vmx->exit_reason to -1 if the fast path succeeds.
> > 
> > Actually, rather than make this super special case, what about moving the
> > handling into vmx_handle_exit_irqoff()?  Practically speaking that would
> > only add ~50 cycles (two VMREADs) relative to the code being run right
> > after kvm_put_guest_xcr0().  It has the advantage of restoring the host's
> > hardware breakpoints, preserving a semi-accurate last_guest_tsc, and
> > running with vcpu->mode set back to OUTSIDE_GUEST_MODE.  Hopefully it'd
> > also be more intuitive for people unfamiliar with the code.
> 
> Yes, that's a good idea.  The expensive bit between handle_exit_irqoff
> and handle_exit is srcu_read_lock, which has two memory barriers in it.

Preaching to the choir at this point, but it'd also eliminate latency
spikes due to interrupts.

> >>> +			if (ret == 0)
> >>> +				ret = kvm_skip_emulated_instruction(vcpu);
> >> Please move the "kvm_skip_emulated_instruction(vcpu)" to
> >> vmx_handle_exit, so that this basically is
> >>
> >> #define EXIT_REASON_NEED_SKIP_EMULATED_INSN -1
> >>
> >> 	if (ret == 0)
> >> 		vcpu->exit_reason = EXIT_REASON_NEED_SKIP_EMULATED_INSN;
> >>
> >> and handle_ipi_fastpath can return void.
> >
> > I'd rather we add a dedicated variable to say the exit has already been
> > handled.  Overloading exit_reason is bound to cause confusion, and that's
> > probably a best case scenario.
> 
> I proposed the fake exit reason to avoid a ternary return code from
> handle_ipi_fastpath (return to guest, return to userspace, call
> kvm_x86_ops->handle_exit), which Wanpeng's patch was mishandling.

For this case, I think we can get away with a WARN if kvm_lapic_reg_write()
fails since it (currently) can't fail for ICR.  That would allow us to keep
a void return for ->handle_exit_irqoff() and avoid an overloaded return
value.

And, if the fastpath is used for all ICR writes, not just FIXED+PHYSICAL,
and is implemented for SVM, then we don't even need a a flag, e.g.
kvm_x2apic_msr_write() can simply ignore ICR writes, similar to how
handle_exception() ignores #MC and NMIs.

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 87b0fcc23ef8..d7b79f7faac1 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2615,12 +2615,11 @@ int kvm_x2apic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data)
        if (!lapic_in_kernel(vcpu) || !apic_x2apic_mode(apic))
                return 1;

-       if (reg == APIC_ICR2)
+
+       /* ICR2 writes are ignored and ICR writes are handled early. */
+       if (reg == APIC_ICR2 || reg == APIC_ICR)
                return 1;

-       /* if this is ICR write vector before command */
-       if (reg == APIC_ICR)
-               kvm_lapic_reg_write(apic, APIC_ICR2, (u32)(data >> 32));
        return kvm_lapic_reg_write(apic, reg, (u32)data);
 }

Another bonus to this approach is that the refactoring for the
exit_reason can be done in a separate series.

> To ensure confusion does not become the best case scenario, perhaps it
> is worth trying to push exit_reason into vcpu_enter_guest's stack.
> vcpu_enter_guest can pass a pointer to it, and then it can be passed
> back into kvm_x86_ops->handle_exit{,_irqoff}.  It could be a struct too,
> instead of just a bare u32.

On the other hand, if it's a bare u32 then kvm_x86_ops->run can simply
return the value instead of doing out parameter shenanigans.

> This would ensure at compile-time that exit_reason is not accessed
> outside the short path from vmexit to kvm_x86_ops->handle_exit.

That would be nice.  Surprisingly, the code actually appears to be fairly
clean, I thought for sure the nested stuff would be using exit_reason all
over the place.  The only one that needs to be fixed is handle_vmfunc(),
which passes vmx->exit_reason when forwarding the VM-Exit instead of simply
hardcoding EXIT_REASON_VMFUNC.
Wanpeng Li Nov. 15, 2019, 5:56 a.m. UTC | #11
On Fri, 15 Nov 2019 at 00:34, Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Nov 14, 2019 at 04:44:33PM +0100, Paolo Bonzini wrote:
> > On 14/11/19 16:22, Sean Christopherson wrote:
> > >> Instead of a separate vcpu->fast_vmexit, perhaps you can set exit_reason
> > >> to vmx->exit_reason to -1 if the fast path succeeds.
> > >
> > > Actually, rather than make this super special case, what about moving the
> > > handling into vmx_handle_exit_irqoff()?  Practically speaking that would
> > > only add ~50 cycles (two VMREADs) relative to the code being run right
> > > after kvm_put_guest_xcr0().  It has the advantage of restoring the host's
> > > hardware breakpoints, preserving a semi-accurate last_guest_tsc, and
> > > running with vcpu->mode set back to OUTSIDE_GUEST_MODE.  Hopefully it'd
> > > also be more intuitive for people unfamiliar with the code.
> >
> > Yes, that's a good idea.  The expensive bit between handle_exit_irqoff
> > and handle_exit is srcu_read_lock, which has two memory barriers in it.

Moving the handling into vmx_handle_exit_irqoff() can worse ~100
cycles than right after kvm_put_guest_xcr0() in my testing. So, which
one do you prefer?

For moving the handling into vmx_handle_exit_irqoff(), how about the
patch below:

---8<---

From 1b20b0a80c6da18b721f125cc40f8be5ad31f4b1 Mon Sep 17 00:00:00 2001
From: Wanpeng Li <wanpengli@tencent.com>
Date: Fri, 15 Nov 2019 10:18:09 +0800
Subject: [PATCH v2] KVM: VMX: FIXED+PHYSICAL mode single target IPI fastpath

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++--
 arch/x86/include/uapi/asm/vmx.h |  1 +
 arch/x86/kvm/svm.c              |  4 ++--
 arch/x86/kvm/vmx/vmx.c          | 40 +++++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c              |  5 +++--
 5 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 24d6598..3604f3a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1067,7 +1067,7 @@ struct kvm_x86_ops {
     void (*tlb_flush_gva)(struct kvm_vcpu *vcpu, gva_t addr);

     void (*run)(struct kvm_vcpu *vcpu);
-    int (*handle_exit)(struct kvm_vcpu *vcpu);
+    int (*handle_exit)(struct kvm_vcpu *vcpu, u32 *vcpu_exit_reason);
     int (*skip_emulated_instruction)(struct kvm_vcpu *vcpu);
     void (*set_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
     u32 (*get_interrupt_shadow)(struct kvm_vcpu *vcpu);
@@ -1117,7 +1117,7 @@ struct kvm_x86_ops {
     int (*check_intercept)(struct kvm_vcpu *vcpu,
                    struct x86_instruction_info *info,
                    enum x86_intercept_stage stage);
-    void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu);
+    void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu, u32 *vcpu_exit_reason);
     bool (*mpx_supported)(void);
     bool (*xsaves_supported)(void);
     bool (*umip_emulated)(void);
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index 3eb8411..b33c6e1 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -88,6 +88,7 @@
 #define EXIT_REASON_XRSTORS             64
 #define EXIT_REASON_UMWAIT              67
 #define EXIT_REASON_TPAUSE              68
+#define EXIT_REASON_NEED_SKIP_EMULATED_INSN -1

 #define VMX_EXIT_REASONS \
     { EXIT_REASON_EXCEPTION_NMI,         "EXCEPTION_NMI" }, \
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index c5673bd..3bb0661 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4927,7 +4927,7 @@ static void svm_get_exit_info(struct kvm_vcpu
*vcpu, u64 *info1, u64 *info2)
     *info2 = control->exit_info_2;
 }

-static int handle_exit(struct kvm_vcpu *vcpu)
+static int handle_exit(struct kvm_vcpu *vcpu, u32 *vcpu_exit_reason)
 {
     struct vcpu_svm *svm = to_svm(vcpu);
     struct kvm_run *kvm_run = vcpu->run;
@@ -6171,7 +6171,7 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
     return ret;
 }

-static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu, u32
*vcpu_exit_reason)
 {

 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5d21a4a..073fe1f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5838,7 +5838,7 @@ void dump_vmcs(void)
  * The guest has exited.  See if we can fix it or if we need userspace
  * assistance.
  */
-static int vmx_handle_exit(struct kvm_vcpu *vcpu)
+static int vmx_handle_exit(struct kvm_vcpu *vcpu, u32 *vcpu_exit_reason)
 {
     struct vcpu_vmx *vmx = to_vmx(vcpu);
     u32 exit_reason = vmx->exit_reason;
@@ -5924,7 +5924,10 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
         }
     }

-    if (exit_reason < kvm_vmx_max_exit_handlers
+    if (*vcpu_exit_reason == EXIT_REASON_NEED_SKIP_EMULATED_INSN) {
+        kvm_skip_emulated_instruction(vcpu);
+        return 1;
+    } else if (exit_reason < kvm_vmx_max_exit_handlers
         && kvm_vmx_exit_handlers[exit_reason])
         return kvm_vmx_exit_handlers[exit_reason](vcpu);
     else {
@@ -6255,7 +6258,36 @@ static void
handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
 }
 STACK_FRAME_NON_STANDARD(handle_external_interrupt_irqoff);

-static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+static u32 handle_ipi_fastpath(struct kvm_vcpu *vcpu)
+{
+    u32 index;
+    u64 data;
+    int ret = 0;
+
+    if (lapic_in_kernel(vcpu) && apic_x2apic_mode(vcpu->arch.apic)) {
+        /*
+         * fastpath to IPI target, FIXED+PHYSICAL which is popular
+         */
+        index = kvm_rcx_read(vcpu);
+        data = kvm_read_edx_eax(vcpu);
+
+        if (((index - APIC_BASE_MSR) << 4 == APIC_ICR) &&
+            ((data & KVM_APIC_DEST_MASK) == APIC_DEST_PHYSICAL) &&
+            ((data & APIC_MODE_MASK) == APIC_DM_FIXED)) {
+
+            trace_kvm_msr_write(index, data);
+            kvm_lapic_set_reg(vcpu->arch.apic, APIC_ICR2, (u32)(data >> 32));
+            ret = kvm_lapic_reg_write(vcpu->arch.apic, APIC_ICR, (u32)data);
+
+            if (ret == 0)
+                return EXIT_REASON_NEED_SKIP_EMULATED_INSN;
+        }
+    }
+
+    return ret;
+}
+
+static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu, u32 *exit_reason)
 {
     struct vcpu_vmx *vmx = to_vmx(vcpu);

@@ -6263,6 +6295,8 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
         handle_external_interrupt_irqoff(vcpu);
     else if (vmx->exit_reason == EXIT_REASON_EXCEPTION_NMI)
         handle_exception_nmi_irqoff(vmx);
+    else if (vmx->exit_reason == EXIT_REASON_MSR_WRITE)
+        *exit_reason = handle_ipi_fastpath(vcpu);
 }

 static bool vmx_has_emulated_msr(int index)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ff395f8..130576b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7931,6 +7931,7 @@ EXPORT_SYMBOL_GPL(__kvm_request_immediate_exit);
 static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 {
     int r;
+    u32 exit_reason = 0;
     bool req_int_win =
         dm_request_for_irq_injection(vcpu) &&
         kvm_cpu_accept_dm_intr(vcpu);
@@ -8180,7 +8181,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
     vcpu->mode = OUTSIDE_GUEST_MODE;
     smp_wmb();

-    kvm_x86_ops->handle_exit_irqoff(vcpu);
+    kvm_x86_ops->handle_exit_irqoff(vcpu, &exit_reason);

     /*
      * Consume any pending interrupts, including the possible source of
@@ -8224,7 +8225,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
         kvm_lapic_sync_from_vapic(vcpu);

     vcpu->arch.gpa_available = false;
-    r = kvm_x86_ops->handle_exit(vcpu);
+    r = kvm_x86_ops->handle_exit(vcpu, &exit_reason);
     return r;

 cancel_injection:
--
2.7.4
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5d21a4a..5c67061 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5924,7 +5924,9 @@  static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 		}
 	}
 
-	if (exit_reason < kvm_vmx_max_exit_handlers
+	if (vcpu->fast_vmexit)
+		return 1;
+	else if (exit_reason < kvm_vmx_max_exit_handlers
 	    && kvm_vmx_exit_handlers[exit_reason])
 		return kvm_vmx_exit_handlers[exit_reason](vcpu);
 	else {
@@ -6474,6 +6476,34 @@  void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)
 
 bool __vmx_vcpu_run(struct vcpu_vmx *vmx, unsigned long *regs, bool launched);
 
+static int handle_ipi_fastpath(struct kvm_vcpu *vcpu)
+{
+	u32 index;
+	u64 data;
+	int ret = 0;
+
+	if (lapic_in_kernel(vcpu) && apic_x2apic_mode(vcpu->arch.apic)) {
+		/*
+		 * fastpath to IPI target
+		 */
+		index = kvm_rcx_read(vcpu);
+		data = kvm_read_edx_eax(vcpu);
+
+		if (((index - APIC_BASE_MSR) << 4 == APIC_ICR) &&
+			((data & KVM_APIC_DEST_MASK) == APIC_DEST_PHYSICAL) &&
+			((data & APIC_MODE_MASK) == APIC_DM_FIXED)) {
+
+			kvm_lapic_set_reg(vcpu->arch.apic, APIC_ICR2, (u32)(data >> 32));
+			ret = kvm_lapic_reg_write(vcpu->arch.apic, APIC_ICR, (u32)data);
+
+			if (ret == 0)
+				ret = kvm_skip_emulated_instruction(vcpu);
+		};
+	};
+
+	return ret;
+}
+
 static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6615,6 +6645,12 @@  static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 				  | (1 << VCPU_EXREG_CR3));
 	vcpu->arch.regs_dirty = 0;
 
+	vmx->exit_reason = vmx->fail ? 0xdead : vmcs_read32(VM_EXIT_REASON);
+	vcpu->fast_vmexit = false;
+	if (!is_guest_mode(vcpu) &&
+		vmx->exit_reason == EXIT_REASON_MSR_WRITE)
+		vcpu->fast_vmexit = handle_ipi_fastpath(vcpu);
+
 	pt_guest_exit(vmx);
 
 	/*
@@ -6634,7 +6670,6 @@  static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	vmx->nested.nested_run_pending = 0;
 	vmx->idt_vectoring_info = 0;
 
-	vmx->exit_reason = vmx->fail ? 0xdead : vmcs_read32(VM_EXIT_REASON);
 	if ((u16)vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY)
 		kvm_machine_check();
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 719fc3e..7a7358b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -319,6 +319,7 @@  struct kvm_vcpu {
 #endif
 	bool preempted;
 	bool ready;
+	bool fast_vmexit;
 	struct kvm_vcpu_arch arch;
 	struct dentry *debugfs_dentry;
 };