[RFC,V2,0/9] x86/mmu:Introduce parallel memory virtualization to boost performance

Message ID	cover.1598868203.git.yulei.kernel@gmail.com (mailing list archive)
Headers	show Return-Path: <SRS0=IPmA=CK=vger.kernel.org=kvm-owner@kernel.org> From: yulei.kernel@gmail.com To: pbonzini@redhat.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, sean.j.christopherson@intel.com, jmattson@google.com, junaids@google.com, bgardon@google.com, vkuznets@redhat.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang <yulei.kernel@gmail.com> Subject: [RFC V2 0/9] x86/mmu:Introduce parallel memory virtualization to boost performance Date: Tue, 1 Sep 2020 19:52:42 +0800 Message-Id: <cover.1598868203.git.yulei.kernel@gmail.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	x86/mmu:Introduce parallel memory virtualization to boost performance \| expand [RFC,V2,0/9] x86/mmu:Introduce parallel memory virtualization to boost performance [RFC,V2,1/9] Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT support [RFC,V2,2/9] Introduce page table population function for direct build EPT feature [RFC,V2,3/9] Introduce page table remove function for direct build EPT feature [RFC,V2,4/9] Add release function for direct build ept when guest VM exit [RFC,V2,5/9] Modify the page fault path to meet the direct build EPT requirement [RFC,V2,6/9] Apply the direct build EPT according to the memory slots change [RFC,V2,7/9] Add migration support when using direct build EPT [RFC,V2,8/9] Introduce kvm module parameter global_tdp to turn on the direct build EPT mode [RFC,V2,9/9] Handle certain mmu exposed functions properly while turn on direct build EPT mode

yulei zhang Sept. 1, 2020, 11:52 a.m. UTC

From: Yulei Zhang <yulei.kernel@gmail.com>

Currently in KVM memory virtulization we relay on mmu_lock to
synchronize the memory mapping update, which make vCPUs work
in serialize mode and slow down the execution, especially after
migration to do substantial memory mapping will cause visible
performance drop, and it can get worse if guest has more vCPU
numbers and memories.
  
The idea we present in this patch set is to mitigate the issue
with pre-constructed memory mapping table. We will fast pin the
guest memory to build up a global memory mapping table according
to the guest memslots changes and apply it to cr3, so that after
guest starts up all the vCPUs would be able to update the memory
simultaneously without page fault exception, thus the performance
improvement is expected. 

We use memory dirty pattern workload to test the initial patch
set and get positive result even with huge page enabled. For example,
we create guest with 32 vCPUs and 64G memories, and let the vcpus
dirty the entire memory region concurrently, as the initial patch
eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
get the job done in about 50% faster.

We only validate this feature on Intel x86 platform. And as Ben
pointed out in RFC V1, so far we disable the SMM for resource
consideration, drop the mmu notification as in this case the
memory is pinned.

V1->V2:
* Rebase the code to kernel version 5.9.0-rc1.

Yulei Zhang (9):
  Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
    support
  Introduce page table population function for direct build EPT feature
  Introduce page table remove function for direct build EPT feature
  Add release function for direct build ept when guest VM exit
  Modify the page fault path to meet the direct build EPT requirement
  Apply the direct build EPT according to the memory slots change
  Add migration support when using direct build EPT
  Introduce kvm module parameter global_tdp to turn on the direct build
    EPT mode
  Handle certain mmu exposed functions properly while turn on direct
    build EPT mode

 arch/mips/kvm/mips.c            |  13 +
 arch/powerpc/kvm/powerpc.c      |  13 +
 arch/s390/kvm/kvm-s390.c        |  13 +
 arch/x86/include/asm/kvm_host.h |  13 +-
 arch/x86/kvm/mmu/mmu.c          | 533 ++++++++++++++++++++++++++++++--
 arch/x86/kvm/svm/svm.c          |   2 +-
 arch/x86/kvm/vmx/vmx.c          |   7 +-
 arch/x86/kvm/x86.c              |  55 ++--
 include/linux/kvm_host.h        |   7 +-
 virt/kvm/kvm_main.c             |  43 ++-
 10 files changed, 639 insertions(+), 60 deletions(-)

Wanpeng Li Sept. 9, 2020, 3:04 a.m. UTC | #1

Any comments? guys!
On Tue, 1 Sep 2020 at 19:52, <yulei.kernel@gmail.com> wrote:
>
> From: Yulei Zhang <yulei.kernel@gmail.com>
>
> Currently in KVM memory virtulization we relay on mmu_lock to
> synchronize the memory mapping update, which make vCPUs work
> in serialize mode and slow down the execution, especially after
> migration to do substantial memory mapping will cause visible
> performance drop, and it can get worse if guest has more vCPU
> numbers and memories.
>
> The idea we present in this patch set is to mitigate the issue
> with pre-constructed memory mapping table. We will fast pin the
> guest memory to build up a global memory mapping table according
> to the guest memslots changes and apply it to cr3, so that after
> guest starts up all the vCPUs would be able to update the memory
> simultaneously without page fault exception, thus the performance
> improvement is expected.
>
> We use memory dirty pattern workload to test the initial patch
> set and get positive result even with huge page enabled. For example,
> we create guest with 32 vCPUs and 64G memories, and let the vcpus
> dirty the entire memory region concurrently, as the initial patch
> eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
> get the job done in about 50% faster.
>
> We only validate this feature on Intel x86 platform. And as Ben
> pointed out in RFC V1, so far we disable the SMM for resource
> consideration, drop the mmu notification as in this case the
> memory is pinned.
>
> V1->V2:
> * Rebase the code to kernel version 5.9.0-rc1.
>
> Yulei Zhang (9):
>   Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
>     support
>   Introduce page table population function for direct build EPT feature
>   Introduce page table remove function for direct build EPT feature
>   Add release function for direct build ept when guest VM exit
>   Modify the page fault path to meet the direct build EPT requirement
>   Apply the direct build EPT according to the memory slots change
>   Add migration support when using direct build EPT
>   Introduce kvm module parameter global_tdp to turn on the direct build
>     EPT mode
>   Handle certain mmu exposed functions properly while turn on direct
>     build EPT mode
>
>  arch/mips/kvm/mips.c            |  13 +
>  arch/powerpc/kvm/powerpc.c      |  13 +
>  arch/s390/kvm/kvm-s390.c        |  13 +
>  arch/x86/include/asm/kvm_host.h |  13 +-
>  arch/x86/kvm/mmu/mmu.c          | 533 ++++++++++++++++++++++++++++++--
>  arch/x86/kvm/svm/svm.c          |   2 +-
>  arch/x86/kvm/vmx/vmx.c          |   7 +-
>  arch/x86/kvm/x86.c              |  55 ++--
>  include/linux/kvm_host.h        |   7 +-
>  virt/kvm/kvm_main.c             |  43 ++-
>  10 files changed, 639 insertions(+), 60 deletions(-)
>
> --
> 2.17.1
>

Wanpeng Li Sept. 24, 2020, 6:28 a.m. UTC | #2

Any comments? Paolo! :)
On Wed, 9 Sep 2020 at 11:04, Wanpeng Li <kernellwp@gmail.com> wrote:
>
> Any comments? guys!
> On Tue, 1 Sep 2020 at 19:52, <yulei.kernel@gmail.com> wrote:
> >
> > From: Yulei Zhang <yulei.kernel@gmail.com>
> >
> > Currently in KVM memory virtulization we relay on mmu_lock to
> > synchronize the memory mapping update, which make vCPUs work
> > in serialize mode and slow down the execution, especially after
> > migration to do substantial memory mapping will cause visible
> > performance drop, and it can get worse if guest has more vCPU
> > numbers and memories.
> >
> > The idea we present in this patch set is to mitigate the issue
> > with pre-constructed memory mapping table. We will fast pin the
> > guest memory to build up a global memory mapping table according
> > to the guest memslots changes and apply it to cr3, so that after
> > guest starts up all the vCPUs would be able to update the memory
> > simultaneously without page fault exception, thus the performance
> > improvement is expected.
> >
> > We use memory dirty pattern workload to test the initial patch
> > set and get positive result even with huge page enabled. For example,
> > we create guest with 32 vCPUs and 64G memories, and let the vcpus
> > dirty the entire memory region concurrently, as the initial patch
> > eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
> > get the job done in about 50% faster.
> >
> > We only validate this feature on Intel x86 platform. And as Ben
> > pointed out in RFC V1, so far we disable the SMM for resource
> > consideration, drop the mmu notification as in this case the
> > memory is pinned.
> >
> > V1->V2:
> > * Rebase the code to kernel version 5.9.0-rc1.
> >
> > Yulei Zhang (9):
> >   Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
> >     support
> >   Introduce page table population function for direct build EPT feature
> >   Introduce page table remove function for direct build EPT feature
> >   Add release function for direct build ept when guest VM exit
> >   Modify the page fault path to meet the direct build EPT requirement
> >   Apply the direct build EPT according to the memory slots change
> >   Add migration support when using direct build EPT
> >   Introduce kvm module parameter global_tdp to turn on the direct build
> >     EPT mode
> >   Handle certain mmu exposed functions properly while turn on direct
> >     build EPT mode
> >
> >  arch/mips/kvm/mips.c            |  13 +
> >  arch/powerpc/kvm/powerpc.c      |  13 +
> >  arch/s390/kvm/kvm-s390.c        |  13 +
> >  arch/x86/include/asm/kvm_host.h |  13 +-
> >  arch/x86/kvm/mmu/mmu.c          | 533 ++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/svm/svm.c          |   2 +-
> >  arch/x86/kvm/vmx/vmx.c          |   7 +-
> >  arch/x86/kvm/x86.c              |  55 ++--
> >  include/linux/kvm_host.h        |   7 +-
> >  virt/kvm/kvm_main.c             |  43 ++-
> >  10 files changed, 639 insertions(+), 60 deletions(-)
> >
> > --
> > 2.17.1
> >

Ben Gardon Sept. 24, 2020, 5:14 p.m. UTC | #3

On Wed, Sep 23, 2020 at 11:28 PM Wanpeng Li <kernellwp@gmail.com> wrote:
>
> Any comments? Paolo! :)

Hi, sorry to be so late in replying! I wanted to post the first part
of the TDP MMU series I've been working on before responding so we
could discuss the two together, but I haven't been able to get it out
as fast as I would have liked. (I'll send it ASAP!) I'm hopeful that
it will ultimately help address some of the page fault handling and
lock contention issues you're addressing with these patches. I'd also
be happy to work together to add a prepopulation feature to it. I'll
put in some more comments inline below.

> On Wed, 9 Sep 2020 at 11:04, Wanpeng Li <kernellwp@gmail.com> wrote:
> >
> > Any comments? guys!
> > On Tue, 1 Sep 2020 at 19:52, <yulei.kernel@gmail.com> wrote:
> > >
> > > From: Yulei Zhang <yulei.kernel@gmail.com>
> > >
> > > Currently in KVM memory virtulization we relay on mmu_lock to
> > > synchronize the memory mapping update, which make vCPUs work
> > > in serialize mode and slow down the execution, especially after
> > > migration to do substantial memory mapping will cause visible
> > > performance drop, and it can get worse if guest has more vCPU
> > > numbers and memories.
> > >
> > > The idea we present in this patch set is to mitigate the issue
> > > with pre-constructed memory mapping table. We will fast pin the
> > > guest memory to build up a global memory mapping table according
> > > to the guest memslots changes and apply it to cr3, so that after
> > > guest starts up all the vCPUs would be able to update the memory
> > > simultaneously without page fault exception, thus the performance
> > > improvement is expected.

My understanding from this RFC is that your primary goal is to
eliminate page fault latencies and lock contention arising from the
first page faults incurred by vCPUs when initially populating the EPT.
Is that right?

I have the impression that the pinning and generally static memory
mappings are more a convenient simplification than part of a larger
goal to avoid incurring page faults down the line. Is that correct?

I ask because I didn't fully understand, from our conversation on v1
of this RFC, why reimplementing the page fault handler and associated
functions was necessary for the above goals, as I understood them.
My impression of the prepopulation approach is that, KVM will
sequentially populate all the EPT entries to map guest memory. I
understand how this could be optimized to be quite efficient, but I
don't understand how it would scale better than the existing
implementation with one vCPU accessing memory.

> > >
> > > We use memory dirty pattern workload to test the initial patch
> > > set and get positive result even with huge page enabled. For example,
> > > we create guest with 32 vCPUs and 64G memories, and let the vcpus
> > > dirty the entire memory region concurrently, as the initial patch
> > > eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
> > > get the job done in about 50% faster.

In this benchmark did you include the time required to pre-populate
the EPT or just the time required for the vCPUs to dirty memory?
I ask because I'm curious if your priority is to decrease the total
end-to-end time, or you just care about the guest experience, and not
so much the VM startup time.
How does this compare to the case where 1 vCPU reads every page of
memory and then 32 vCPUs concurrently dirty every page?

> > >
> > > We only validate this feature on Intel x86 platform. And as Ben
> > > pointed out in RFC V1, so far we disable the SMM for resource
> > > consideration, drop the mmu notification as in this case the
> > > memory is pinned.

I'm excited to see big MMU changes like this, and I look forward to
combining our needs towards a better MMU for the x86 TDP case. Have
you thought about how you would build SMM and MMU notifier support
onto this patch series? I know that the invalidate range notifiers, at
least, added a lot of non-trivial complexity to the direct MMU
implementation I presented last year.

> > >
> > > V1->V2:
> > > * Rebase the code to kernel version 5.9.0-rc1.
> > >
> > > Yulei Zhang (9):
> > >   Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
> > >     support
> > >   Introduce page table population function for direct build EPT feature
> > >   Introduce page table remove function for direct build EPT feature
> > >   Add release function for direct build ept when guest VM exit
> > >   Modify the page fault path to meet the direct build EPT requirement
> > >   Apply the direct build EPT according to the memory slots change
> > >   Add migration support when using direct build EPT
> > >   Introduce kvm module parameter global_tdp to turn on the direct build
> > >     EPT mode
> > >   Handle certain mmu exposed functions properly while turn on direct
> > >     build EPT mode
> > >
> > >  arch/mips/kvm/mips.c            |  13 +
> > >  arch/powerpc/kvm/powerpc.c      |  13 +
> > >  arch/s390/kvm/kvm-s390.c        |  13 +
> > >  arch/x86/include/asm/kvm_host.h |  13 +-
> > >  arch/x86/kvm/mmu/mmu.c          | 533 ++++++++++++++++++++++++++++++--
> > >  arch/x86/kvm/svm/svm.c          |   2 +-
> > >  arch/x86/kvm/vmx/vmx.c          |   7 +-
> > >  arch/x86/kvm/x86.c              |  55 ++--
> > >  include/linux/kvm_host.h        |   7 +-
> > >  virt/kvm/kvm_main.c             |  43 ++-
> > >  10 files changed, 639 insertions(+), 60 deletions(-)
> > >
> > > --
> > > 2.17.1
> > >

yulei zhang Sept. 25, 2020, 12:04 p.m. UTC | #4

On Fri, Sep 25, 2020 at 1:14 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Wed, Sep 23, 2020 at 11:28 PM Wanpeng Li <kernellwp@gmail.com> wrote:
> >
> > Any comments? Paolo! :)
>
> Hi, sorry to be so late in replying! I wanted to post the first part
> of the TDP MMU series I've been working on before responding so we
> could discuss the two together, but I haven't been able to get it out
> as fast as I would have liked. (I'll send it ASAP!) I'm hopeful that
> it will ultimately help address some of the page fault handling and
> lock contention issues you're addressing with these patches. I'd also
> be happy to work together to add a prepopulation feature to it. I'll
> put in some more comments inline below.
>

Thanks for the feedback and looking forward to your patchset.

> > On Wed, 9 Sep 2020 at 11:04, Wanpeng Li <kernellwp@gmail.com> wrote:
> > >
> > > Any comments? guys!
> > > On Tue, 1 Sep 2020 at 19:52, <yulei.kernel@gmail.com> wrote:
> > > >
> > > > From: Yulei Zhang <yulei.kernel@gmail.com>
> > > >
> > > > Currently in KVM memory virtulization we relay on mmu_lock to
> > > > synchronize the memory mapping update, which make vCPUs work
> > > > in serialize mode and slow down the execution, especially after
> > > > migration to do substantial memory mapping will cause visible
> > > > performance drop, and it can get worse if guest has more vCPU
> > > > numbers and memories.
> > > >
> > > > The idea we present in this patch set is to mitigate the issue
> > > > with pre-constructed memory mapping table. We will fast pin the
> > > > guest memory to build up a global memory mapping table according
> > > > to the guest memslots changes and apply it to cr3, so that after
> > > > guest starts up all the vCPUs would be able to update the memory
> > > > simultaneously without page fault exception, thus the performance
> > > > improvement is expected.
>
> My understanding from this RFC is that your primary goal is to
> eliminate page fault latencies and lock contention arising from the
> first page faults incurred by vCPUs when initially populating the EPT.
> Is that right?
>

That's right.

> I have the impression that the pinning and generally static memory
> mappings are more a convenient simplification than part of a larger
> goal to avoid incurring page faults down the line. Is that correct?
>
> I ask because I didn't fully understand, from our conversation on v1
> of this RFC, why reimplementing the page fault handler and associated
> functions was necessary for the above goals, as I understood them.
> My impression of the prepopulation approach is that, KVM will
> sequentially populate all the EPT entries to map guest memory. I
> understand how this could be optimized to be quite efficient, but I
> don't understand how it would scale better than the existing
> implementation with one vCPU accessing memory.
>

I don't think our goal is to simply eliminate the page fault. Our
target scenario
is in live migration, when the workload resume on the destination VM after
migrate, it will kick off the vcpus to build the gfn to pfn mapping,
but due to the
mmu_lock it holds the vcpus to execute in sequential which significantly slows
down the workload execution in VM and affect the end user experience, especially
when it is memory sensitive workload. Pre-populate the EPT entries
will solve the
problem smoothly as it allows the vcpus to execute in parallel after migration.

> > > >
> > > > We use memory dirty pattern workload to test the initial patch
> > > > set and get positive result even with huge page enabled. For example,
> > > > we create guest with 32 vCPUs and 64G memories, and let the vcpus
> > > > dirty the entire memory region concurrently, as the initial patch
> > > > eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
> > > > get the job done in about 50% faster.
>
> In this benchmark did you include the time required to pre-populate
> the EPT or just the time required for the vCPUs to dirty memory?
> I ask because I'm curious if your priority is to decrease the total
> end-to-end time, or you just care about the guest experience, and not
> so much the VM startup time.

We compare the time for each vcpu thread to finish the dirty job. Yes, it can
take some time for the page table pre-populate, but as each vcpu thread
can gain a huge advantage with concurrent dirty write, if we count that in
the total time it is still a better result.

> How does this compare to the case where 1 vCPU reads every page of
> memory and then 32 vCPUs concurrently dirty every page?
>

Haven't tried this yet, I think the major difference would be the page fault
latency introduced by the one vCPU read.

> > > >
> > > > We only validate this feature on Intel x86 platform. And as Ben
> > > > pointed out in RFC V1, so far we disable the SMM for resource
> > > > consideration, drop the mmu notification as in this case the
> > > > memory is pinned.
>
> I'm excited to see big MMU changes like this, and I look forward to
> combining our needs towards a better MMU for the x86 TDP case. Have
> you thought about how you would build SMM and MMU notifier support
> onto this patch series? I know that the invalidate range notifiers, at
> least, added a lot of non-trivial complexity to the direct MMU
> implementation I presented last year.
>

Thanks for the suggestion, I will think about it.

> > > >
> > > > V1->V2:
> > > > * Rebase the code to kernel version 5.9.0-rc1.
> > > >
> > > > Yulei Zhang (9):
> > > >   Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
> > > >     support
> > > >   Introduce page table population function for direct build EPT feature
> > > >   Introduce page table remove function for direct build EPT feature
> > > >   Add release function for direct build ept when guest VM exit
> > > >   Modify the page fault path to meet the direct build EPT requirement
> > > >   Apply the direct build EPT according to the memory slots change
> > > >   Add migration support when using direct build EPT
> > > >   Introduce kvm module parameter global_tdp to turn on the direct build
> > > >     EPT mode
> > > >   Handle certain mmu exposed functions properly while turn on direct
> > > >     build EPT mode
> > > >
> > > >  arch/mips/kvm/mips.c            |  13 +
> > > >  arch/powerpc/kvm/powerpc.c      |  13 +
> > > >  arch/s390/kvm/kvm-s390.c        |  13 +
> > > >  arch/x86/include/asm/kvm_host.h |  13 +-
> > > >  arch/x86/kvm/mmu/mmu.c          | 533 ++++++++++++++++++++++++++++++--
> > > >  arch/x86/kvm/svm/svm.c          |   2 +-
> > > >  arch/x86/kvm/vmx/vmx.c          |   7 +-
> > > >  arch/x86/kvm/x86.c              |  55 ++--
> > > >  include/linux/kvm_host.h        |   7 +-
> > > >  virt/kvm/kvm_main.c             |  43 ++-
> > > >  10 files changed, 639 insertions(+), 60 deletions(-)
> > > >
> > > > --
> > > > 2.17.1
> > > >

Ben Gardon Sept. 25, 2020, 5:30 p.m. UTC | #5

On Fri, Sep 25, 2020 at 5:04 AM yulei zhang <yulei.kernel@gmail.com> wrote:
>
> On Fri, Sep 25, 2020 at 1:14 AM Ben Gardon <bgardon@google.com> wrote:
> >
> > On Wed, Sep 23, 2020 at 11:28 PM Wanpeng Li <kernellwp@gmail.com> wrote:
> > >
> > > Any comments? Paolo! :)
> >
> > Hi, sorry to be so late in replying! I wanted to post the first part
> > of the TDP MMU series I've been working on before responding so we
> > could discuss the two together, but I haven't been able to get it out
> > as fast as I would have liked. (I'll send it ASAP!) I'm hopeful that
> > it will ultimately help address some of the page fault handling and
> > lock contention issues you're addressing with these patches. I'd also
> > be happy to work together to add a prepopulation feature to it. I'll
> > put in some more comments inline below.
> >
>
> Thanks for the feedback and looking forward to your patchset.
>
> > > On Wed, 9 Sep 2020 at 11:04, Wanpeng Li <kernellwp@gmail.com> wrote:
> > > >
> > > > Any comments? guys!
> > > > On Tue, 1 Sep 2020 at 19:52, <yulei.kernel@gmail.com> wrote:
> > > > >
> > > > > From: Yulei Zhang <yulei.kernel@gmail.com>
> > > > >
> > > > > Currently in KVM memory virtulization we relay on mmu_lock to
> > > > > synchronize the memory mapping update, which make vCPUs work
> > > > > in serialize mode and slow down the execution, especially after
> > > > > migration to do substantial memory mapping will cause visible
> > > > > performance drop, and it can get worse if guest has more vCPU
> > > > > numbers and memories.
> > > > >
> > > > > The idea we present in this patch set is to mitigate the issue
> > > > > with pre-constructed memory mapping table. We will fast pin the
> > > > > guest memory to build up a global memory mapping table according
> > > > > to the guest memslots changes and apply it to cr3, so that after
> > > > > guest starts up all the vCPUs would be able to update the memory
> > > > > simultaneously without page fault exception, thus the performance
> > > > > improvement is expected.
> >
> > My understanding from this RFC is that your primary goal is to
> > eliminate page fault latencies and lock contention arising from the
> > first page faults incurred by vCPUs when initially populating the EPT.
> > Is that right?
> >
>
> That's right.
>
> > I have the impression that the pinning and generally static memory
> > mappings are more a convenient simplification than part of a larger
> > goal to avoid incurring page faults down the line. Is that correct?
> >
> > I ask because I didn't fully understand, from our conversation on v1
> > of this RFC, why reimplementing the page fault handler and associated
> > functions was necessary for the above goals, as I understood them.
> > My impression of the prepopulation approach is that, KVM will
> > sequentially populate all the EPT entries to map guest memory. I
> > understand how this could be optimized to be quite efficient, but I
> > don't understand how it would scale better than the existing
> > implementation with one vCPU accessing memory.
> >
>
> I don't think our goal is to simply eliminate the page fault. Our
> target scenario
> is in live migration, when the workload resume on the destination VM after
> migrate, it will kick off the vcpus to build the gfn to pfn mapping,
> but due to the
> mmu_lock it holds the vcpus to execute in sequential which significantly slows
> down the workload execution in VM and affect the end user experience, especially
> when it is memory sensitive workload. Pre-populate the EPT entries
> will solve the
> problem smoothly as it allows the vcpus to execute in parallel after migration.

Oh, thank you for explaining that. I didn't realize the goal here was
to improve LM performance. I was under the impression that this was to
give VMs a better experience on startup for fast scaling or something.
In your testing with live migration how has this affected the
distribution of time between the phases of live migration? Just for
terminology (since I'm not sure how standard it is across the
industry) I think of a live migration as consisting of 3 stages:
precopy, blackout, and postcopy. In precopy we're tracking the VM's
working set via dirty logging and sending the contents of its memory
to the target host. In blackout we pause the vCPUs on the source, copy
minimal data to the target, and resume the vCPUs on the target. In
postcopy we may still have some pages that have not been copied to the
target and so request those in response to vCPU page faults via user
fault fd or some other mechanism.

Does EPT pre-population preclude the use of a postcopy phase? I would
expect that to make the blackout phase really long. Has that not been
a problem for you?

I love the idea of partial EPT pre-population during precopy if you
could still handle postcopy and just pre-populate as memory came in.

>
> > > > >
> > > > > We use memory dirty pattern workload to test the initial patch
> > > > > set and get positive result even with huge page enabled. For example,
> > > > > we create guest with 32 vCPUs and 64G memories, and let the vcpus
> > > > > dirty the entire memory region concurrently, as the initial patch
> > > > > eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
> > > > > get the job done in about 50% faster.
> >
> > In this benchmark did you include the time required to pre-populate
> > the EPT or just the time required for the vCPUs to dirty memory?
> > I ask because I'm curious if your priority is to decrease the total
> > end-to-end time, or you just care about the guest experience, and not
> > so much the VM startup time.
>
> We compare the time for each vcpu thread to finish the dirty job. Yes, it can
> take some time for the page table pre-populate, but as each vcpu thread
> can gain a huge advantage with concurrent dirty write, if we count that in
> the total time it is still a better result.

That makes sense to me. Your implementation definitely seems more
efficient than the existing PF handling path. It's probably much
easier to parallelize as a sort of recursive population operation too.

>
> > How does this compare to the case where 1 vCPU reads every page of
> > memory and then 32 vCPUs concurrently dirty every page?
> >
>
> Haven't tried this yet, I think the major difference would be the page fault
> latency introduced by the one vCPU read.

I agree. The whole VM exit path adds a lot of overhead. I wonder what
kind of numbers you'd get it you cranked PTE_PREFETCH_NUM way up
though. If you set that to >= your memory size, one PF could
pre-populate the entire EPT. It's a silly approach, but it would be a
lot more efficient as an easy POC.

>
> > > > >
> > > > > We only validate this feature on Intel x86 platform. And as Ben
> > > > > pointed out in RFC V1, so far we disable the SMM for resource
> > > > > consideration, drop the mmu notification as in this case the
> > > > > memory is pinned.
> >
> > I'm excited to see big MMU changes like this, and I look forward to
> > combining our needs towards a better MMU for the x86 TDP case. Have
> > you thought about how you would build SMM and MMU notifier support
> > onto this patch series? I know that the invalidate range notifiers, at
> > least, added a lot of non-trivial complexity to the direct MMU
> > implementation I presented last year.
> >
>
> Thanks for the suggestion, I will think about it.
>
> > > > >
> > > > > V1->V2:
> > > > > * Rebase the code to kernel version 5.9.0-rc1.
> > > > >
> > > > > Yulei Zhang (9):
> > > > >   Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
> > > > >     support
> > > > >   Introduce page table population function for direct build EPT feature
> > > > >   Introduce page table remove function for direct build EPT feature
> > > > >   Add release function for direct build ept when guest VM exit
> > > > >   Modify the page fault path to meet the direct build EPT requirement
> > > > >   Apply the direct build EPT according to the memory slots change
> > > > >   Add migration support when using direct build EPT
> > > > >   Introduce kvm module parameter global_tdp to turn on the direct build
> > > > >     EPT mode
> > > > >   Handle certain mmu exposed functions properly while turn on direct
> > > > >     build EPT mode
> > > > >
> > > > >  arch/mips/kvm/mips.c            |  13 +
> > > > >  arch/powerpc/kvm/powerpc.c      |  13 +
> > > > >  arch/s390/kvm/kvm-s390.c        |  13 +
> > > > >  arch/x86/include/asm/kvm_host.h |  13 +-
> > > > >  arch/x86/kvm/mmu/mmu.c          | 533 ++++++++++++++++++++++++++++++--
> > > > >  arch/x86/kvm/svm/svm.c          |   2 +-
> > > > >  arch/x86/kvm/vmx/vmx.c          |   7 +-
> > > > >  arch/x86/kvm/x86.c              |  55 ++--
> > > > >  include/linux/kvm_host.h        |   7 +-
> > > > >  virt/kvm/kvm_main.c             |  43 ++-
> > > > >  10 files changed, 639 insertions(+), 60 deletions(-)
> > > > >
> > > > > --
> > > > > 2.17.1
> > > > >

Paolo Bonzini Sept. 25, 2020, 8:50 p.m. UTC | #6

On 25/09/20 19:30, Ben Gardon wrote:
> Oh, thank you for explaining that. I didn't realize the goal here was
> to improve LM performance. I was under the impression that this was to
> give VMs a better experience on startup for fast scaling or something.
> In your testing with live migration how has this affected the
> distribution of time between the phases of live migration? Just for
> terminology (since I'm not sure how standard it is across the
> industry) I think of a live migration as consisting of 3 stages:
> precopy, blackout, and postcopy. In precopy we're tracking the VM's
> working set via dirty logging and sending the contents of its memory
> to the target host. In blackout we pause the vCPUs on the source, copy
> minimal data to the target, and resume the vCPUs on the target. In
> postcopy we may still have some pages that have not been copied to the
> target and so request those in response to vCPU page faults via user
> fault fd or some other mechanism.
> 
> Does EPT pre-population preclude the use of a postcopy phase?

I think so.

As a quick recap, turn postcopy migration handles two kinds of
pages---they can be copied to the destination either in background
(stuff that was dirty when userspace decided to transition to the
blackout phase) or on-demand (relayed from KVM to userspace via
get_user_pages and userfaultfd).  Normally only on-demand pages would be
served through userfaultfd, while with prepopulation every missing page
would be faulted in from the kernel through userfaultfd.  In practice
this would just extend the blackout phase.

Paolo

> I would
> expect that to make the blackout phase really long. Has that not been
> a problem for you?
> 
> I love the idea of partial EPT pre-population during precopy if you
> could still handle postcopy and just pre-populate as memory came in.
>

yulei zhang Sept. 28, 2020, 11:52 a.m. UTC | #7

On Sat, Sep 26, 2020 at 4:50 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 25/09/20 19:30, Ben Gardon wrote:
> > Oh, thank you for explaining that. I didn't realize the goal here was
> > to improve LM performance. I was under the impression that this was to
> > give VMs a better experience on startup for fast scaling or something.
> > In your testing with live migration how has this affected the
> > distribution of time between the phases of live migration? Just for
> > terminology (since I'm not sure how standard it is across the
> > industry) I think of a live migration as consisting of 3 stages:
> > precopy, blackout, and postcopy. In precopy we're tracking the VM's
> > working set via dirty logging and sending the contents of its memory
> > to the target host. In blackout we pause the vCPUs on the source, copy
> > minimal data to the target, and resume the vCPUs on the target. In
> > postcopy we may still have some pages that have not been copied to the
> > target and so request those in response to vCPU page faults via user
> > fault fd or some other mechanism.
> >
> > Does EPT pre-population preclude the use of a postcopy phase?
>
> I think so.
>
> As a quick recap, turn postcopy migration handles two kinds of
> pages---they can be copied to the destination either in background
> (stuff that was dirty when userspace decided to transition to the
> blackout phase) or on-demand (relayed from KVM to userspace via
> get_user_pages and userfaultfd).  Normally only on-demand pages would be
> served through userfaultfd, while with prepopulation every missing page
> would be faulted in from the kernel through userfaultfd.  In practice
> this would just extend the blackout phase.
>
> Paolo
>

Yep, you are right, based on current implementation it doesn't support the
postcopy. Thanks for the suggestion, we will try to fill the gap with proper
EPT population during the post-copy.

> > I would
> > expect that to make the blackout phase really long. Has that not been
> > a problem for you?
> >
> > I love the idea of partial EPT pre-population during precopy if you
> > could still handle postcopy and just pre-populate as memory came in.
> >
>

[RFC,V2,0/9] x86/mmu:Introduce parallel memory virtualization to boost performance

Message

Comments