mbox series

[v2,0/6] x86/kernel/hyper-v: xmm fast hypercall

Message ID cover.1540441925.git.isaku.yamahata@gmail.com (mailing list archive)
Headers show
Series x86/kernel/hyper-v: xmm fast hypercall | expand

Message

Isaku Yamahata Oct. 25, 2018, 4:48 a.m. UTC
This patch series implements xmm fast hypercall for hyper-v as guest
and kvm support as VMM.
With this patch, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE without
gva list, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX(vcpu > 64) and
HVCALL_SEND_IPI_EX(vcpu > 64) can use xmm fast hypercall.

benchmark result:
At the moment, my test machine have only pcpu=4, ipi benchmark doesn't
make any behaviour change. So for now I measured the time of
hyperv_flush_tlb_others() by ktap with 'hardinfo -r -f text'.
the average of 5 runs are as follows.
(When large machine with pcpu > 64 is avaialble, ipi_benchmark result is
interesting. But not yet now.)

hyperv_flush_tlb_others() time by hardinfo -r -f text:

with path:       9931 ns
without patch:  11111 ns


With patch of 4bd06060762b, __send_ipi_mask() now uses fast hypercall
when possible. so in the case of vcpu=4. So I used kernel before the parch
to measure the effect of xmm fast hypercall with ipi_benchmark.
The following is the average of 100 runs.

ipi_benchmark: average of 100 runs without 4bd06060762b

with patch:
Dry-run                 0        495181
Self-IPI         11352737      21549999
Normal IPI      499400218     575433727
Broadcast IPI           0    1700692010
Broadcast lock          0    1663001374

without patch:
Dry-run                 0        607657
Self-IPI         10915950      21217644
Normal IPI      621712609     735015570
Broadcast IPI           0    2173803373
Broadcast lock          0    2150451543

Isaku Yamahata (6):
  x86/kernel/hyper-v: xmm fast hypercall as guest
  x86/hyperv: use hv_do_hypercall for __send_ipi_mask_ex()
  x86/hyperv: use hv_do_hypercall for flush_virtual_address_space_ex
  hyperv: use hv_do_hypercall instead of hv_do_fast_hypercall
  x86/kvm/hyperv: implement xmm fast hypercall
  local: hyperv: test ex hypercall

 arch/x86/hyperv/hv_apic.c           |  16 +++-
 arch/x86/hyperv/mmu.c               |  24 +++--
 arch/x86/hyperv/nested.c            |   2 +-
 arch/x86/include/asm/hyperv-tlfs.h  |   3 +
 arch/x86/include/asm/mshyperv.h     | 180 ++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/hyperv.c               | 101 ++++++++++++++++----
 drivers/hv/connection.c             |   3 +-
 drivers/hv/hv.c                     |   3 +-
 drivers/pci/controller/pci-hyperv.c |   7 +-
 9 files changed, 291 insertions(+), 48 deletions(-)

Comments

Wanpeng Li Oct. 25, 2018, 5:02 a.m. UTC | #1
On Thu, 25 Oct 2018 at 05:50, Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> This patch series implements xmm fast hypercall for hyper-v as guest
> and kvm support as VMM.
> With this patch, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE without
> gva list, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX(vcpu > 64) and
> HVCALL_SEND_IPI_EX(vcpu > 64) can use xmm fast hypercall.
>
> benchmark result:
> At the moment, my test machine have only pcpu=4, ipi benchmark doesn't

Did you evaluate xmm fast hypercall for pv ipis? In addition, testing
on a large server can provide a forceful evidence.

Regards,
Wanpeng Li

> make any behaviour change. So for now I measured the time of
> hyperv_flush_tlb_others() by ktap with 'hardinfo -r -f text'.
> the average of 5 runs are as follows.
> (When large machine with pcpu > 64 is avaialble, ipi_benchmark result is
> interesting. But not yet now.)
>
> hyperv_flush_tlb_others() time by hardinfo -r -f text:
>
> with path:       9931 ns
> without patch:  11111 ns
>
>
> With patch of 4bd06060762b, __send_ipi_mask() now uses fast hypercall
> when possible. so in the case of vcpu=4. So I used kernel before the parch
> to measure the effect of xmm fast hypercall with ipi_benchmark.
> The following is the average of 100 runs.
>
> ipi_benchmark: average of 100 runs without 4bd06060762b
>
> with patch:
> Dry-run                 0        495181
> Self-IPI         11352737      21549999
> Normal IPI      499400218     575433727
> Broadcast IPI           0    1700692010
> Broadcast lock          0    1663001374
>
> without patch:
> Dry-run                 0        607657
> Self-IPI         10915950      21217644
> Normal IPI      621712609     735015570
> Broadcast IPI           0    2173803373
> Broadcast lock          0    2150451543
>
> Isaku Yamahata (6):
>   x86/kernel/hyper-v: xmm fast hypercall as guest
>   x86/hyperv: use hv_do_hypercall for __send_ipi_mask_ex()
>   x86/hyperv: use hv_do_hypercall for flush_virtual_address_space_ex
>   hyperv: use hv_do_hypercall instead of hv_do_fast_hypercall
>   x86/kvm/hyperv: implement xmm fast hypercall
>   local: hyperv: test ex hypercall
>
>  arch/x86/hyperv/hv_apic.c           |  16 +++-
>  arch/x86/hyperv/mmu.c               |  24 +++--
>  arch/x86/hyperv/nested.c            |   2 +-
>  arch/x86/include/asm/hyperv-tlfs.h  |   3 +
>  arch/x86/include/asm/mshyperv.h     | 180 ++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/hyperv.c               | 101 ++++++++++++++++----
>  drivers/hv/connection.c             |   3 +-
>  drivers/hv/hv.c                     |   3 +-
>  drivers/pci/controller/pci-hyperv.c |   7 +-
>  9 files changed, 291 insertions(+), 48 deletions(-)
>
> --
> 2.14.1
>
Isaku Yamahata Oct. 25, 2018, 5:37 p.m. UTC | #2
On Thu, Oct 25, 2018 at 06:02:58AM +0100,
Wanpeng Li <kernellwp@gmail.com> wrote:

> On Thu, 25 Oct 2018 at 05:50, Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
> >
> > This patch series implements xmm fast hypercall for hyper-v as guest
> > and kvm support as VMM.
> > With this patch, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE without
> > gva list, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX(vcpu > 64) and
> > HVCALL_SEND_IPI_EX(vcpu > 64) can use xmm fast hypercall.
> >
> > benchmark result:
> > At the moment, my test machine have only pcpu=4, ipi benchmark doesn't
> 
> Did you evaluate xmm fast hypercall for pv ipis? In addition, testing
> on a large server can provide a forceful evidence.

Please see the below results which was done without the changeset of
4bd06060762b to use xmm fasthypercall forcibly with vcpu=4.

Right now I'm looking for a machine with pcpu > 64, but it would
take a while. I wanted to sent out the patch early so that
someone else can test/benchmark it.

Thanks,

> 
> Regards,
> Wanpeng Li
> 
> > make any behaviour change. So for now I measured the time of
> > hyperv_flush_tlb_others() by ktap with 'hardinfo -r -f text'.
> > the average of 5 runs are as follows.
> > (When large machine with pcpu > 64 is avaialble, ipi_benchmark result is
> > interesting. But not yet now.)
> >
> > hyperv_flush_tlb_others() time by hardinfo -r -f text:
> >
> > with path:       9931 ns
> > without patch:  11111 ns
> >
> >
> > With patch of 4bd06060762b, __send_ipi_mask() now uses fast hypercall
> > when possible. so in the case of vcpu=4. So I used kernel before the parch
> > to measure the effect of xmm fast hypercall with ipi_benchmark.
> > The following is the average of 100 runs.
> >
> > ipi_benchmark: average of 100 runs without 4bd06060762b
> >
> > with patch:
> > Dry-run                 0        495181
> > Self-IPI         11352737      21549999
> > Normal IPI      499400218     575433727
> > Broadcast IPI           0    1700692010
> > Broadcast lock          0    1663001374
> >
> > without patch:
> > Dry-run                 0        607657
> > Self-IPI         10915950      21217644
> > Normal IPI      621712609     735015570
> > Broadcast IPI           0    2173803373
> > Broadcast lock          0    2150451543
> >
> > Isaku Yamahata (6):
> >   x86/kernel/hyper-v: xmm fast hypercall as guest
> >   x86/hyperv: use hv_do_hypercall for __send_ipi_mask_ex()
> >   x86/hyperv: use hv_do_hypercall for flush_virtual_address_space_ex
> >   hyperv: use hv_do_hypercall instead of hv_do_fast_hypercall
> >   x86/kvm/hyperv: implement xmm fast hypercall
> >   local: hyperv: test ex hypercall
> >
> >  arch/x86/hyperv/hv_apic.c           |  16 +++-
> >  arch/x86/hyperv/mmu.c               |  24 +++--
> >  arch/x86/hyperv/nested.c            |   2 +-
> >  arch/x86/include/asm/hyperv-tlfs.h  |   3 +
> >  arch/x86/include/asm/mshyperv.h     | 180 ++++++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/hyperv.c               | 101 ++++++++++++++++----
> >  drivers/hv/connection.c             |   3 +-
> >  drivers/hv/hv.c                     |   3 +-
> >  drivers/pci/controller/pci-hyperv.c |   7 +-
> >  9 files changed, 291 insertions(+), 48 deletions(-)
> >
> > --
> > 2.14.1
> >
KY Srinivasan Oct. 26, 2018, 9:05 a.m. UTC | #3
We will test this as well.

K. Y

-----Original Message-----
From: Isaku Yamahata <isaku.yamahata@gmail.com> 
Sent: Thursday, October 25, 2018 10:38 AM
To: Wanpeng Li <kernellwp@gmail.com>
Cc: isaku.yamahata@gmail.com; kvm <kvm@vger.kernel.org>; Paolo Bonzini <pbonzini@redhat.com>; Radim Krcmar <rkrcmar@redhat.com>; vkuznets <vkuznets@redhat.com>; KY Srinivasan <kys@microsoft.com>; Tianyu Lan <Tianyu.Lan@microsoft.com>; yi.y.sun@intel.com; chao.gao@intel.com; isaku.yamahata@intel.com
Subject: Re: [PATCH v2 0/6] x86/kernel/hyper-v: xmm fast hypercall

On Thu, Oct 25, 2018 at 06:02:58AM +0100, Wanpeng Li <kernellwp@gmail.com> wrote:

> On Thu, 25 Oct 2018 at 05:50, Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
> >
> > This patch series implements xmm fast hypercall for hyper-v as guest 
> > and kvm support as VMM.
> > With this patch, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE without gva 
> > list, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX(vcpu > 64) and 
> > HVCALL_SEND_IPI_EX(vcpu > 64) can use xmm fast hypercall.
> >
> > benchmark result:
> > At the moment, my test machine have only pcpu=4, ipi benchmark 
> > doesn't
> 
> Did you evaluate xmm fast hypercall for pv ipis? In addition, testing 
> on a large server can provide a forceful evidence.

Please see the below results which was done without the changeset of 4bd06060762b to use xmm fasthypercall forcibly with vcpu=4.

Right now I'm looking for a machine with pcpu > 64, but it would take a while. I wanted to sent out the patch early so that someone else can test/benchmark it.

Thanks,

> 
> Regards,
> Wanpeng Li
> 
> > make any behaviour change. So for now I measured the time of
> > hyperv_flush_tlb_others() by ktap with 'hardinfo -r -f text'.
> > the average of 5 runs are as follows.
> > (When large machine with pcpu > 64 is avaialble, ipi_benchmark 
> > result is interesting. But not yet now.)
> >
> > hyperv_flush_tlb_others() time by hardinfo -r -f text:
> >
> > with path:       9931 ns
> > without patch:  11111 ns
> >
> >
> > With patch of 4bd06060762b, __send_ipi_mask() now uses fast 
> > hypercall when possible. so in the case of vcpu=4. So I used kernel 
> > before the parch to measure the effect of xmm fast hypercall with ipi_benchmark.
> > The following is the average of 100 runs.
> >
> > ipi_benchmark: average of 100 runs without 4bd06060762b
> >
> > with patch:
> > Dry-run                 0        495181
> > Self-IPI         11352737      21549999
> > Normal IPI      499400218     575433727
> > Broadcast IPI           0    1700692010
> > Broadcast lock          0    1663001374
> >
> > without patch:
> > Dry-run                 0        607657
> > Self-IPI         10915950      21217644
> > Normal IPI      621712609     735015570
> > Broadcast IPI           0    2173803373
> > Broadcast lock          0    2150451543
> >
> > Isaku Yamahata (6):
> >   x86/kernel/hyper-v: xmm fast hypercall as guest
> >   x86/hyperv: use hv_do_hypercall for __send_ipi_mask_ex()
> >   x86/hyperv: use hv_do_hypercall for flush_virtual_address_space_ex
> >   hyperv: use hv_do_hypercall instead of hv_do_fast_hypercall
> >   x86/kvm/hyperv: implement xmm fast hypercall
> >   local: hyperv: test ex hypercall
> >
> >  arch/x86/hyperv/hv_apic.c           |  16 +++-
> >  arch/x86/hyperv/mmu.c               |  24 +++--
> >  arch/x86/hyperv/nested.c            |   2 +-
> >  arch/x86/include/asm/hyperv-tlfs.h  |   3 +
> >  arch/x86/include/asm/mshyperv.h     | 180 ++++++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/hyperv.c               | 101 ++++++++++++++++----
> >  drivers/hv/connection.c             |   3 +-
> >  drivers/hv/hv.c                     |   3 +-
> >  drivers/pci/controller/pci-hyperv.c |   7 +-
> >  9 files changed, 291 insertions(+), 48 deletions(-)
> >
> > --
> > 2.14.1
> >

--
Isaku Yamahata <isaku.yamahata@gmail.com>
Roman Kagan Oct. 29, 2018, 6:22 p.m. UTC | #4
On Wed, Oct 24, 2018 at 09:48:25PM -0700, Isaku Yamahata wrote:
> This patch series implements xmm fast hypercall for hyper-v as guest
> and kvm support as VMM.

I think it may be a good idea to do it in separate patchsets.  They're
probably targeted at different maintainer trees (x86/hyperv vs kvm) and
the only thing they have in common is a couple of new defines in
hyperv-tlfs.h.  

> With this patch, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE without
> gva list, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX(vcpu > 64) and
> HVCALL_SEND_IPI_EX(vcpu > 64) can use xmm fast hypercall.
> 
> benchmark result:
> At the moment, my test machine have only pcpu=4, ipi benchmark doesn't
> make any behaviour change. So for now I measured the time of
> hyperv_flush_tlb_others() by ktap with 'hardinfo -r -f text'.

This suggests that the guest OS was Linux with your patches 1-4.  What
was the hypervisor?  KVM with your patch 5 or Hyper-V proper?

> the average of 5 runs are as follows.
> (When large machine with pcpu > 64 is avaialble, ipi_benchmark result is
> interesting. But not yet now.)

Are you referring to https://patchwork.kernel.org/patch/10122703/ ?
Has it landed anywhere in the tree?  I seem unable to find it...

> hyperv_flush_tlb_others() time by hardinfo -r -f text:
> 
> with path:       9931 ns
> without patch:  11111 ns
> 
> 
> With patch of 4bd06060762b, __send_ipi_mask() now uses fast hypercall
> when possible. so in the case of vcpu=4. So I used kernel before the parch
> to measure the effect of xmm fast hypercall with ipi_benchmark.
> The following is the average of 100 runs.
> 
> ipi_benchmark: average of 100 runs without 4bd06060762b
> 
> with patch:
> Dry-run                 0        495181
> Self-IPI         11352737      21549999
> Normal IPI      499400218     575433727
> Broadcast IPI           0    1700692010
> Broadcast lock          0    1663001374
> 
> without patch:
> Dry-run                 0        607657
> Self-IPI         10915950      21217644
> Normal IPI      621712609     735015570

This is about 122 ms difference in IPI sending time, and 160 ms in
total time, i.e. extra 38 ms for the acknowledge.  AFAICS the
acknowledge path should be exactly the same.  Any idea where these
additional 38 ms come from?

> Broadcast IPI           0    2173803373

This one is strange, too: the difference should only be on the sending
side, and there it should be basically constant with the number of cpus.
So I would expect the patched vs unpatched delta to be about the same as
for "Normal IPI".  Am I missing something?

> Broadcast lock          0    2150451543

Thanks,
Roman.
Isaku Yamahata Oct. 30, 2018, 2:43 a.m. UTC | #5
On Mon, Oct 29, 2018 at 06:22:14PM +0000,
Roman Kagan <rkagan@virtuozzo.com> wrote:

> On Wed, Oct 24, 2018 at 09:48:25PM -0700, Isaku Yamahata wrote:
> > This patch series implements xmm fast hypercall for hyper-v as guest
> > and kvm support as VMM.
> 
> I think it may be a good idea to do it in separate patchsets.  They're
> probably targeted at different maintainer trees (x86/hyperv vs kvm) and
> the only thing they have in common is a couple of new defines in
> hyperv-tlfs.h.  
> 
> > With this patch, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE without
> > gva list, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX(vcpu > 64) and
> > HVCALL_SEND_IPI_EX(vcpu > 64) can use xmm fast hypercall.
> > 
> > benchmark result:
> > At the moment, my test machine have only pcpu=4, ipi benchmark doesn't
> > make any behaviour change. So for now I measured the time of
> > hyperv_flush_tlb_others() by ktap with 'hardinfo -r -f text'.
> 
> This suggests that the guest OS was Linux with your patches 1-4.  What
> was the hypervisor?  KVM with your patch 5 or Hyper-V proper?

For patch 1-4, it's hyper-v.
For patch 5, it's kvm with hyper-v hypercall support.

I'll split this patch series to avoid confustion.


> 
> > the average of 5 runs are as follows.
> > (When large machine with pcpu > 64 is avaialble, ipi_benchmark result is
> > interesting. But not yet now.)
> 
> Are you referring to https://patchwork.kernel.org/patch/10122703/ ?
> Has it landed anywhere in the tree?  I seem unable to find it...

Yes, that patch. it's not merged yet.


> 
> > hyperv_flush_tlb_others() time by hardinfo -r -f text:
> > 
> > with path:       9931 ns
> > without patch:  11111 ns
> > 
> > 
> > With patch of 4bd06060762b, __send_ipi_mask() now uses fast hypercall
> > when possible. so in the case of vcpu=4. So I used kernel before the parch
> > to measure the effect of xmm fast hypercall with ipi_benchmark.
> > The following is the average of 100 runs.
> > 
> > ipi_benchmark: average of 100 runs without 4bd06060762b
> > 
> > with patch:
> > Dry-run                 0        495181
> > Self-IPI         11352737      21549999
> > Normal IPI      499400218     575433727
> > Broadcast IPI           0    1700692010
> > Broadcast lock          0    1663001374
> > 
> > without patch:
> > Dry-run                 0        607657
> > Self-IPI         10915950      21217644
> > Normal IPI      621712609     735015570
> 
> This is about 122 ms difference in IPI sending time, and 160 ms in
> total time, i.e. extra 38 ms for the acknowledge.  AFAICS the
> acknowledge path should be exactly the same.  Any idea where these
> additional 38 ms come from?
> 
> > Broadcast IPI           0    2173803373
> 
> This one is strange, too: the difference should only be on the sending
> side, and there it should be basically constant with the number of cpus.
> So I would expect the patched vs unpatched delta to be about the same as
> for "Normal IPI".  Am I missing something?

The result seems very sensitive to host activity and so is unstable.
(pcpu=vcpu=4 in the benchmark.)
Since the benchmark should be on large machine(vcpu>64) anyway,
I didn't dig further.

Thanks,

> 
> > Broadcast lock          0    2150451543
> 
> Thanks,
> Roman.
Roman Kagan Oct. 30, 2018, 7:19 a.m. UTC | #6
On Mon, Oct 29, 2018 at 07:43:19PM -0700, Isaku Yamahata wrote:
> On Mon, Oct 29, 2018 at 06:22:14PM +0000,
> Roman Kagan <rkagan@virtuozzo.com> wrote:
> > On Wed, Oct 24, 2018 at 09:48:25PM -0700, Isaku Yamahata wrote:
> > > With this patch, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE without
> > > gva list, HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX(vcpu > 64) and
> > > HVCALL_SEND_IPI_EX(vcpu > 64) can use xmm fast hypercall.
> > > 
> > > benchmark result:
> > > At the moment, my test machine have only pcpu=4, ipi benchmark doesn't
> > > make any behaviour change. So for now I measured the time of
> > > hyperv_flush_tlb_others() by ktap with 'hardinfo -r -f text'.
> > 
> > This suggests that the guest OS was Linux with your patches 1-4.  What
> > was the hypervisor?  KVM with your patch 5 or Hyper-V proper?
> 
> For patch 1-4, it's hyper-v.
> For patch 5, it's kvm with hyper-v hypercall support.

So you have two result sets?  Which one was in your post?

It'd also be curious to run some IPI or TLB flush sensintive benchmark
with a Windows guest.

> > > hyperv_flush_tlb_others() time by hardinfo -r -f text:
> > > 
> > > with path:       9931 ns
> > > without patch:  11111 ns
> > > 
> > > 
> > > With patch of 4bd06060762b, __send_ipi_mask() now uses fast hypercall
> > > when possible. so in the case of vcpu=4. So I used kernel before the parch
> > > to measure the effect of xmm fast hypercall with ipi_benchmark.
> > > The following is the average of 100 runs.
> > > 
> > > ipi_benchmark: average of 100 runs without 4bd06060762b
> > > 
> > > with patch:
> > > Dry-run                 0        495181
> > > Self-IPI         11352737      21549999
> > > Normal IPI      499400218     575433727
> > > Broadcast IPI           0    1700692010
> > > Broadcast lock          0    1663001374
> > > 
> > > without patch:
> > > Dry-run                 0        607657
> > > Self-IPI         10915950      21217644
> > > Normal IPI      621712609     735015570
> > 
> > This is about 122 ms difference in IPI sending time, and 160 ms in
> > total time, i.e. extra 38 ms for the acknowledge.  AFAICS the
> > acknowledge path should be exactly the same.  Any idea where these
> > additional 38 ms come from?
> > 
> > > Broadcast IPI           0    2173803373
> > 
> > This one is strange, too: the difference should only be on the sending
> > side, and there it should be basically constant with the number of cpus.
> > So I would expect the patched vs unpatched delta to be about the same as
> > for "Normal IPI".  Am I missing something?
> 
> The result seems very sensitive to host activity and so is unstable.
> (pcpu=vcpu=4 in the benchmark.)
> Since the benchmark should be on large machine(vcpu>64) anyway,

IMO the bigger the vcpu set you want to pass in the hypercall, the less
competitive the xmm fast version is.

I think realistically every implementation of xmm fast both in the guest
and in the hypervisor will actually use the parameter block in memory
(and so does yours), so the difference between xmm fast and regular
hypercalls is the cost of loading/storing the parameter block to/from
xmm (+ preserving the task fpu state) in the guest vs mapping the
parameter block in the hypervisor.  The latter is constant (per spec the
parameter block can't cross page boundaries so it's always one page
exactly), the former grows with the size of the parameter block.

So I think if there's no conclusive win on a small machine there's no
reason to expect it to be on a big one.

Thanks,
Roman.