mbox series

[RFC,00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU

Message ID 20211119235759.1304274-1-dmatlack@google.com (mailing list archive)
Headers show
Series KVM: x86/mmu: Eager Page Splitting for the TDP MMU | expand

Message

David Matlack Nov. 19, 2021, 11:57 p.m. UTC
This series is a first pass at implementing Eager Page Splitting for the
TDP MMU. For context on the motivation and design of Eager Page
Splitting, please see the RFC design proposal and discussion [1].

Paolo, I went ahead and added splitting in both the intially-all-set
case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
case where we are not using initially-all-set (splitting the entire
memslot when dirty logging is enabled) to give you an idea of what
both look like.

Note: I will be on vacation all of next week so I will not be able to
respond to reviews until Monday November 29. I thought it would be
useful to seed discussion and reviews with an early version of the code
rather than putting it off another week. But feel free to also ignore
this until I get back :)

This series compiles and passes the most basic splitting test:

$ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4

But please operate under the assumption that this code is probably
buggy.

[1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t

David Matlack (15):
  KVM: x86/mmu: Rename rmap_write_protect to kvm_vcpu_write_protect_gfn
  KVM: x86/mmu: Rename __rmap_write_protect to rmap_write_protect
  KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails
  KVM: x86/mmu: Factor out logic to atomically install a new page table
  KVM: x86/mmu: Abstract mmu caches out to a separate struct
  KVM: x86/mmu: Derive page role from parent
  KVM: x86/mmu: Pass in vcpu->arch.mmu_caches instead of vcpu
  KVM: x86/mmu: Helper method to check for large and present sptes
  KVM: x86/mmu: Move restore_acc_track_spte to spte.c
  KVM: x86/mmu: Abstract need_resched logic from
    tdp_mmu_iter_cond_resched
  KVM: x86/mmu: Refactor tdp_mmu iterators to take kvm_mmu_page root
  KVM: x86/mmu: Split large pages when dirty logging is enabled
  KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG
  KVM: x86/mmu: Add tracepoint for splitting large pages
  KVM: x86/mmu: Update page stats when splitting large pages

 arch/x86/include/asm/kvm_host.h |  22 ++-
 arch/x86/kvm/mmu/mmu.c          | 185 +++++++++++++-----
 arch/x86/kvm/mmu/mmu_internal.h |   3 +
 arch/x86/kvm/mmu/mmutrace.h     |  20 ++
 arch/x86/kvm/mmu/spte.c         |  64 +++++++
 arch/x86/kvm/mmu/spte.h         |   7 +
 arch/x86/kvm/mmu/tdp_iter.c     |   5 +-
 arch/x86/kvm/mmu/tdp_iter.h     |  10 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 322 +++++++++++++++++++++++---------
 arch/x86/kvm/mmu/tdp_mmu.h      |   5 +
 arch/x86/kvm/x86.c              |   6 +
 11 files changed, 501 insertions(+), 148 deletions(-)

Comments

Peter Xu Nov. 26, 2021, 2:13 p.m. UTC | #1
Hi, David,

On Fri, Nov 19, 2021 at 11:57:44PM +0000, David Matlack wrote:
> This series is a first pass at implementing Eager Page Splitting for the
> TDP MMU. For context on the motivation and design of Eager Page
> Splitting, please see the RFC design proposal and discussion [1].
> 
> Paolo, I went ahead and added splitting in both the intially-all-set
> case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
> case where we are not using initially-all-set (splitting the entire
> memslot when dirty logging is enabled) to give you an idea of what
> both look like.
> 
> Note: I will be on vacation all of next week so I will not be able to
> respond to reviews until Monday November 29. I thought it would be
> useful to seed discussion and reviews with an early version of the code
> rather than putting it off another week. But feel free to also ignore
> this until I get back :)
> 
> This series compiles and passes the most basic splitting test:
> 
> $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4
> 
> But please operate under the assumption that this code is probably
> buggy.
> 
> [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t

Will there be more numbers to show in the formal patchset?  It's interesting to
know how "First Pass Dirty Memory Time" will change comparing to the rfc
numbers; I can have a feel of it, but still. :) Also, not only how it speedup
guest dirty apps, but also some general measurement on how it slows down
KVM_SET_USER_MEMORY_REGION (!init-all-set) or CLEAR_LOG (init-all-set) would be
even nicer (for CLEAR, I guess the 1st/2nd+ round will have different overhead).

Besides that, I'm also wondering whether we should still have a knob for it, as
I'm wondering what if the use case is the kind where eager split huge page may
not help at all.  What I'm thinking:

  - Read-mostly guest overload; split huge page will speed up rare writes, but
    at the meantime drag readers down due to huge->small page mappings.

  - Writes-over-very-limited-region workload: say we have 1T guest and the app
    in the guest only writes 10G part of it.  Hmm not sure whether it exists..

  - Postcopy targeted: it means precopy may only run a few iterations just to
    send the static pages, so the migration duration will be relatively short,
    and the write just didn't spread a lot to the whole guest mem.

I don't really think any of the example is strong enough as they're all very
corner cased, but just to show what I meant to raise this question on whether
unconditionally eager split is the best approach.

Thanks,
David Matlack Nov. 30, 2021, 11:22 p.m. UTC | #2
On Fri, Nov 26, 2021 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, David,
>
> On Fri, Nov 19, 2021 at 11:57:44PM +0000, David Matlack wrote:
> > This series is a first pass at implementing Eager Page Splitting for the
> > TDP MMU. For context on the motivation and design of Eager Page
> > Splitting, please see the RFC design proposal and discussion [1].
> >
> > Paolo, I went ahead and added splitting in both the intially-all-set
> > case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
> > case where we are not using initially-all-set (splitting the entire
> > memslot when dirty logging is enabled) to give you an idea of what
> > both look like.
> >
> > Note: I will be on vacation all of next week so I will not be able to
> > respond to reviews until Monday November 29. I thought it would be
> > useful to seed discussion and reviews with an early version of the code
> > rather than putting it off another week. But feel free to also ignore
> > this until I get back :)
> >
> > This series compiles and passes the most basic splitting test:
> >
> > $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4
> >
> > But please operate under the assumption that this code is probably
> > buggy.
> >
> > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t
>
> Will there be more numbers to show in the formal patchset?

Yes definitely. I didn't have a lot of time to test this series, hence
the RFC status. I'll include more thorough testing and performance
evaluation in the cover letter for v1.


> It's interesting to
> know how "First Pass Dirty Memory Time" will change comparing to the rfc
> numbers; I can have a feel of it, but still. :) Also, not only how it speedup
> guest dirty apps, but also some general measurement on how it slows down
> KVM_SET_USER_MEMORY_REGION (!init-all-set) or CLEAR_LOG (init-all-set) would be
> even nicer (for CLEAR, I guess the 1st/2nd+ round will have different overhead).
>
> Besides that, I'm also wondering whether we should still have a knob for it, as
> I'm wondering what if the use case is the kind where eager split huge page may
> not help at all.  What I'm thinking:
>
>   - Read-mostly guest overload; split huge page will speed up rare writes, but
>     at the meantime drag readers down due to huge->small page mappings.
>
>   - Writes-over-very-limited-region workload: say we have 1T guest and the app
>     in the guest only writes 10G part of it.  Hmm not sure whether it exists..
>
>   - Postcopy targeted: it means precopy may only run a few iterations just to
>     send the static pages, so the migration duration will be relatively short,
>     and the write just didn't spread a lot to the whole guest mem.
>
> I don't really think any of the example is strong enough as they're all very
> corner cased, but just to show what I meant to raise this question on whether
> unconditionally eager split is the best approach.

I'd be happy to add a knob if there's a userspace that wants to use
it. I think the main challenge though is knowing when it is safe to
disable eager splitting. For a small deployment where you know the VM
workload, it might make sense. But for a public cloud provider the
only feasible way would be to dynamically monitor the guest writing
patterns. But then we're back at square one because that would require
dirty logging. And even then, there's no guaranteed way to predict
future guest write patterns based on past patterns.

The way forward here might be to do a hybrid of 2M and 4K dirty
tracking (and maybe even 1G). For example, first start dirty logging
at 2M granularity, and then log at 4K for any specific regions or
memslots that aren't making progress. We'd still use Eager Page
Splitting unconditionally though, first to split to 2M and then to
split to 4K.

>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu Dec. 1, 2021, 4:10 a.m. UTC | #3
On Tue, Nov 30, 2021 at 03:22:29PM -0800, David Matlack wrote:
> On Fri, Nov 26, 2021 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, David,
> >
> > On Fri, Nov 19, 2021 at 11:57:44PM +0000, David Matlack wrote:
> > > This series is a first pass at implementing Eager Page Splitting for the
> > > TDP MMU. For context on the motivation and design of Eager Page
> > > Splitting, please see the RFC design proposal and discussion [1].
> > >
> > > Paolo, I went ahead and added splitting in both the intially-all-set
> > > case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
> > > case where we are not using initially-all-set (splitting the entire
> > > memslot when dirty logging is enabled) to give you an idea of what
> > > both look like.
> > >
> > > Note: I will be on vacation all of next week so I will not be able to
> > > respond to reviews until Monday November 29. I thought it would be
> > > useful to seed discussion and reviews with an early version of the code
> > > rather than putting it off another week. But feel free to also ignore
> > > this until I get back :)
> > >
> > > This series compiles and passes the most basic splitting test:
> > >
> > > $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4
> > >
> > > But please operate under the assumption that this code is probably
> > > buggy.
> > >
> > > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t
> >
> > Will there be more numbers to show in the formal patchset?
> 
> Yes definitely. I didn't have a lot of time to test this series, hence
> the RFC status. I'll include more thorough testing and performance
> evaluation in the cover letter for v1.
> 
> 
> > It's interesting to
> > know how "First Pass Dirty Memory Time" will change comparing to the rfc
> > numbers; I can have a feel of it, but still. :) Also, not only how it speedup
> > guest dirty apps, but also some general measurement on how it slows down
> > KVM_SET_USER_MEMORY_REGION (!init-all-set) or CLEAR_LOG (init-all-set) would be
> > even nicer (for CLEAR, I guess the 1st/2nd+ round will have different overhead).
> >
> > Besides that, I'm also wondering whether we should still have a knob for it, as
> > I'm wondering what if the use case is the kind where eager split huge page may
> > not help at all.  What I'm thinking:
> >
> >   - Read-mostly guest overload; split huge page will speed up rare writes, but
> >     at the meantime drag readers down due to huge->small page mappings.
> >
> >   - Writes-over-very-limited-region workload: say we have 1T guest and the app
> >     in the guest only writes 10G part of it.  Hmm not sure whether it exists..
> >
> >   - Postcopy targeted: it means precopy may only run a few iterations just to
> >     send the static pages, so the migration duration will be relatively short,
> >     and the write just didn't spread a lot to the whole guest mem.
> >
> > I don't really think any of the example is strong enough as they're all very
> > corner cased, but just to show what I meant to raise this question on whether
> > unconditionally eager split is the best approach.
> 
> I'd be happy to add a knob if there's a userspace that wants to use
> it. I think the main challenge though is knowing when it is safe to
> disable eager splitting.

Isn't it a performance feature?  Why it'll be not safe?

> For a small deployment where you know the VM workload, it might make
> sense. But for a public cloud provider the only feasible way would be to
> dynamically monitor the guest writing patterns. But then we're back at square
> one because that would require dirty logging. And even then, there's no
> guaranteed way to predict future guest write patterns based on past patterns.

Agreed, what I was thinking was not for public cloud usages, but for the cases
where we can do specific tunings on some specific scenarios.  It normally won't
matter a lot with small or medium sized VMs but extreme use cases.

> 
> The way forward here might be to do a hybrid of 2M and 4K dirty
> tracking (and maybe even 1G). For example, first start dirty logging
> at 2M granularity, and then log at 4K for any specific regions or
> memslots that aren't making progress. We'd still use Eager Page
> Splitting unconditionally though, first to split to 2M and then to
> split to 4K.

Do you mean we'd also offer different granule dirty bitmap to the userspace
too?

I remembered you mentioned 2mb dirty tracking in your rfc series, but I didn't
expect it can be dynamically switched during tracking.  That sounds a very
intersting idea.

Thanks,
Peter Xu Dec. 1, 2021, 4:19 a.m. UTC | #4
On Wed, Dec 01, 2021 at 12:10:38PM +0800, Peter Xu wrote:
> On Tue, Nov 30, 2021 at 03:22:29PM -0800, David Matlack wrote:
> > On Fri, Nov 26, 2021 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, David,
> > >
> > > On Fri, Nov 19, 2021 at 11:57:44PM +0000, David Matlack wrote:
> > > > This series is a first pass at implementing Eager Page Splitting for the
> > > > TDP MMU. For context on the motivation and design of Eager Page
> > > > Splitting, please see the RFC design proposal and discussion [1].
> > > >
> > > > Paolo, I went ahead and added splitting in both the intially-all-set
> > > > case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
> > > > case where we are not using initially-all-set (splitting the entire
> > > > memslot when dirty logging is enabled) to give you an idea of what
> > > > both look like.
> > > >
> > > > Note: I will be on vacation all of next week so I will not be able to
> > > > respond to reviews until Monday November 29. I thought it would be
> > > > useful to seed discussion and reviews with an early version of the code
> > > > rather than putting it off another week. But feel free to also ignore
> > > > this until I get back :)
> > > >
> > > > This series compiles and passes the most basic splitting test:
> > > >
> > > > $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4
> > > >
> > > > But please operate under the assumption that this code is probably
> > > > buggy.
> > > >
> > > > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t
> > >
> > > Will there be more numbers to show in the formal patchset?
> > 
> > Yes definitely. I didn't have a lot of time to test this series, hence
> > the RFC status. I'll include more thorough testing and performance
> > evaluation in the cover letter for v1.
> > 
> > 
> > > It's interesting to
> > > know how "First Pass Dirty Memory Time" will change comparing to the rfc
> > > numbers; I can have a feel of it, but still. :) Also, not only how it speedup
> > > guest dirty apps, but also some general measurement on how it slows down
> > > KVM_SET_USER_MEMORY_REGION (!init-all-set) or CLEAR_LOG (init-all-set) would be
> > > even nicer (for CLEAR, I guess the 1st/2nd+ round will have different overhead).
> > >
> > > Besides that, I'm also wondering whether we should still have a knob for it, as
> > > I'm wondering what if the use case is the kind where eager split huge page may
> > > not help at all.  What I'm thinking:
> > >
> > >   - Read-mostly guest overload; split huge page will speed up rare writes, but
> > >     at the meantime drag readers down due to huge->small page mappings.
> > >
> > >   - Writes-over-very-limited-region workload: say we have 1T guest and the app
> > >     in the guest only writes 10G part of it.  Hmm not sure whether it exists..
> > >
> > >   - Postcopy targeted: it means precopy may only run a few iterations just to
> > >     send the static pages, so the migration duration will be relatively short,
> > >     and the write just didn't spread a lot to the whole guest mem.
> > >
> > > I don't really think any of the example is strong enough as they're all very
> > > corner cased, but just to show what I meant to raise this question on whether
> > > unconditionally eager split is the best approach.
> > 
> > I'd be happy to add a knob if there's a userspace that wants to use
> > it. I think the main challenge though is knowing when it is safe to
> > disable eager splitting.
> 
> Isn't it a performance feature?  Why it'll be not safe?
> 
> > For a small deployment where you know the VM workload, it might make
> > sense. But for a public cloud provider the only feasible way would be to
> > dynamically monitor the guest writing patterns. But then we're back at square
> > one because that would require dirty logging. And even then, there's no
> > guaranteed way to predict future guest write patterns based on past patterns.
> 
> Agreed, what I was thinking was not for public cloud usages, but for the cases
> where we can do specific tunings on some specific scenarios.  It normally won't
> matter a lot with small or medium sized VMs but extreme use cases.

PS: I think even with a tunable, one static per-module parameter should be far
enough for what I can imagine for now.

> 
> > 
> > The way forward here might be to do a hybrid of 2M and 4K dirty
> > tracking (and maybe even 1G). For example, first start dirty logging
> > at 2M granularity, and then log at 4K for any specific regions or
> > memslots that aren't making progress. We'd still use Eager Page
> > Splitting unconditionally though, first to split to 2M and then to
> > split to 4K.
> 
> Do you mean we'd also offer different granule dirty bitmap to the userspace
> too?
> 
> I remembered you mentioned 2mb dirty tracking in your rfc series, but I didn't
> expect it can be dynamically switched during tracking.  That sounds a very
> intersting idea.
> 
> Thanks,
> 
> -- 
> Peter Xu
David Matlack Dec. 1, 2021, 9:46 p.m. UTC | #5
On Tue, Nov 30, 2021 at 8:10 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Nov 30, 2021 at 03:22:29PM -0800, David Matlack wrote:
> > On Fri, Nov 26, 2021 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, David,
> > >
> > > On Fri, Nov 19, 2021 at 11:57:44PM +0000, David Matlack wrote:
> > > > This series is a first pass at implementing Eager Page Splitting for the
> > > > TDP MMU. For context on the motivation and design of Eager Page
> > > > Splitting, please see the RFC design proposal and discussion [1].
> > > >
> > > > Paolo, I went ahead and added splitting in both the intially-all-set
> > > > case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
> > > > case where we are not using initially-all-set (splitting the entire
> > > > memslot when dirty logging is enabled) to give you an idea of what
> > > > both look like.
> > > >
> > > > Note: I will be on vacation all of next week so I will not be able to
> > > > respond to reviews until Monday November 29. I thought it would be
> > > > useful to seed discussion and reviews with an early version of the code
> > > > rather than putting it off another week. But feel free to also ignore
> > > > this until I get back :)
> > > >
> > > > This series compiles and passes the most basic splitting test:
> > > >
> > > > $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4
> > > >
> > > > But please operate under the assumption that this code is probably
> > > > buggy.
> > > >
> > > > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@mail.gmail.com/#t
> > >
> > > Will there be more numbers to show in the formal patchset?
> >
> > Yes definitely. I didn't have a lot of time to test this series, hence
> > the RFC status. I'll include more thorough testing and performance
> > evaluation in the cover letter for v1.
> >
> >
> > > It's interesting to
> > > know how "First Pass Dirty Memory Time" will change comparing to the rfc
> > > numbers; I can have a feel of it, but still. :) Also, not only how it speedup
> > > guest dirty apps, but also some general measurement on how it slows down
> > > KVM_SET_USER_MEMORY_REGION (!init-all-set) or CLEAR_LOG (init-all-set) would be
> > > even nicer (for CLEAR, I guess the 1st/2nd+ round will have different overhead).
> > >
> > > Besides that, I'm also wondering whether we should still have a knob for it, as
> > > I'm wondering what if the use case is the kind where eager split huge page may
> > > not help at all.  What I'm thinking:
> > >
> > >   - Read-mostly guest overload; split huge page will speed up rare writes, but
> > >     at the meantime drag readers down due to huge->small page mappings.
> > >
> > >   - Writes-over-very-limited-region workload: say we have 1T guest and the app
> > >     in the guest only writes 10G part of it.  Hmm not sure whether it exists..
> > >
> > >   - Postcopy targeted: it means precopy may only run a few iterations just to
> > >     send the static pages, so the migration duration will be relatively short,
> > >     and the write just didn't spread a lot to the whole guest mem.
> > >
> > > I don't really think any of the example is strong enough as they're all very
> > > corner cased, but just to show what I meant to raise this question on whether
> > > unconditionally eager split is the best approach.
> >
> > I'd be happy to add a knob if there's a userspace that wants to use
> > it. I think the main challenge though is knowing when it is safe to
> > disable eager splitting.
>
> Isn't it a performance feature?  Why it'll be not safe?

Heh, "safe" is a bit overzealous. But we've found that as the vCPU
count scales in VMs, not doing Eager Page Splitting leads to
unacceptable performance degradations (per customers), especially when
using the shadow MMU where hugepage write-protection faults are done
while holding the MMU lock in write mode. So from that perspective,
it's "unsafe" to skip Eager Page Splitting unless you are absolutely
sure the guest workload will not be doing much writes.

>
> > For a small deployment where you know the VM workload, it might make
> > sense. But for a public cloud provider the only feasible way would be to
> > dynamically monitor the guest writing patterns. But then we're back at square
> > one because that would require dirty logging. And even then, there's no
> > guaranteed way to predict future guest write patterns based on past patterns.
>
> Agreed, what I was thinking was not for public cloud usages, but for the cases
> where we can do specific tunings on some specific scenarios.  It normally won't
> matter a lot with small or medium sized VMs but extreme use cases.

Ack. I'll include a module parameter in v1 like you suggested your other email.

>
> >
> > The way forward here might be to do a hybrid of 2M and 4K dirty
> > tracking (and maybe even 1G). For example, first start dirty logging
> > at 2M granularity, and then log at 4K for any specific regions or
> > memslots that aren't making progress. We'd still use Eager Page
> > Splitting unconditionally though, first to split to 2M and then to
> > split to 4K.
>
> Do you mean we'd also offer different granule dirty bitmap to the userspace
> too?

Perhaps. The 2M dirty tracking work is still in very early research
phases and the first version will likely not be so dynamic. But I
could imagine we eventually get to the point where we are doing some
hybrid approach.

>
> I remembered you mentioned 2mb dirty tracking in your rfc series, but I didn't
> expect it can be dynamically switched during tracking.  That sounds a very
> intersting idea.
>
> Thanks,

Thanks for all the reviews and feedback on this series!

>
> --
> Peter Xu
>