diff mbox series

[RFC,13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG

Message ID 20211119235759.1304274-14-dmatlack@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Eager Page Splitting for the TDP MMU | expand

Commit Message

David Matlack Nov. 19, 2021, 11:57 p.m. UTC
When using initially-all-set, large pages are not write-protected when
dirty logging is enabled on the memslot. Instead they are
write-protected once userspace invoked CLEAR_DIRTY_LOG for the first
time, and only for the specific sub-region of the memslot that userspace
whishes to clear.

Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to
write-protecting to avoid causing write-protection faults on vCPU
threads. This also allows userspace to smear the cost of large page
splitting across multiple ioctls rather than splitting the entire
memslot when not using initially-all-set.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++++
 arch/x86/kvm/mmu/mmu.c          | 30 ++++++++++++++++++++++--------
 2 files changed, 26 insertions(+), 8 deletions(-)

Comments

Peter Xu Nov. 26, 2021, 12:17 p.m. UTC | #1
On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6768ef9c0891..4e78ef2dd352 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  		gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
>  		gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
>  
> +		/*
> +		 * Try to proactively split any large pages down to 4KB so that
> +		 * vCPUs don't have to take write-protection faults.
> +		 */
> +		kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> +
>  		kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
>  
>  		/* Cross two large pages? */

Is it intended to try split every time even if we could have split it already?
As I remember Paolo mentioned that we can skip split if it's not the 1st
CLEAR_LOG on the same range, and IIUC that makes sense.

But indeed I don't see a trivial way to know whether this is the first clear of
this range.  Maybe we can maintain "how many huge pages are there under current
kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
skip it.  Just a wild idea..

Or maybe it's intended to try split unconditionally for some reason?  If so
it'll be great to mention that either in the commit message or in comments.

Thanks,
David Matlack Dec. 1, 2021, 12:16 a.m. UTC | #2
On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6768ef9c0891..4e78ef2dd352 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> >
> > +             /*
> > +              * Try to proactively split any large pages down to 4KB so that
> > +              * vCPUs don't have to take write-protection faults.
> > +              */
> > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > +
> >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> >
> >               /* Cross two large pages? */
>
> Is it intended to try split every time even if we could have split it already?
> As I remember Paolo mentioned that we can skip split if it's not the 1st
> CLEAR_LOG on the same range, and IIUC that makes sense.
>
> But indeed I don't see a trivial way to know whether this is the first clear of
> this range.  Maybe we can maintain "how many huge pages are there under current
> kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
> skip it.  Just a wild idea..
>
> Or maybe it's intended to try split unconditionally for some reason?  If so
> it'll be great to mention that either in the commit message or in comments.

Thanks for calling this out. Could the same be said about the existing
code that unconditionally tries to write-protect 2M+ pages? I aimed to
keep parity with the write-protection calls (always try to split
before write-protecting) but I agree there might be opportunities
available to skip altogether.

By the way, looking at this code again I think I see some potential bugs:
 - I don't think I ever free split_caches in the initially-all-set case.
 - What happens if splitting fails the CLEAR_LOG but succeeds the
CLEAR_LOG? We would end up propagating the write-protection on the 2M
page down to the 4K page. This might cause issues if using PML.

>
> Thanks,
>
> --
> Peter Xu
>
David Matlack Dec. 1, 2021, 12:17 a.m. UTC | #3
On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@google.com> wrote:
>
> On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 6768ef9c0891..4e78ef2dd352 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> > >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> > >
> > > +             /*
> > > +              * Try to proactively split any large pages down to 4KB so that
> > > +              * vCPUs don't have to take write-protection faults.
> > > +              */
> > > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > > +
> > >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> > >
> > >               /* Cross two large pages? */
> >
> > Is it intended to try split every time even if we could have split it already?
> > As I remember Paolo mentioned that we can skip split if it's not the 1st
> > CLEAR_LOG on the same range, and IIUC that makes sense.
> >
> > But indeed I don't see a trivial way to know whether this is the first clear of
> > this range.  Maybe we can maintain "how many huge pages are there under current
> > kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
> > skip it.  Just a wild idea..
> >
> > Or maybe it's intended to try split unconditionally for some reason?  If so
> > it'll be great to mention that either in the commit message or in comments.
>
> Thanks for calling this out. Could the same be said about the existing
> code that unconditionally tries to write-protect 2M+ pages? I aimed to
> keep parity with the write-protection calls (always try to split
> before write-protecting) but I agree there might be opportunities
> available to skip altogether.
>
> By the way, looking at this code again I think I see some potential bugs:
>  - I don't think I ever free split_caches in the initially-all-set case.
>  - What happens if splitting fails the CLEAR_LOG but succeeds the
> CLEAR_LOG?

Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here.

> We would end up propagating the write-protection on the 2M
> page down to the 4K page. This might cause issues if using PML.
>
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >
Peter Xu Dec. 1, 2021, 4:03 a.m. UTC | #4
On Tue, Nov 30, 2021 at 04:17:01PM -0800, David Matlack wrote:
> On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 6768ef9c0891..4e78ef2dd352 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > > >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> > > >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> > > >
> > > > +             /*
> > > > +              * Try to proactively split any large pages down to 4KB so that
> > > > +              * vCPUs don't have to take write-protection faults.
> > > > +              */
> > > > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > > > +
> > > >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> > > >
> > > >               /* Cross two large pages? */
> > >
> > > Is it intended to try split every time even if we could have split it already?
> > > As I remember Paolo mentioned that we can skip split if it's not the 1st
> > > CLEAR_LOG on the same range, and IIUC that makes sense.
> > >
> > > But indeed I don't see a trivial way to know whether this is the first clear of
> > > this range.  Maybe we can maintain "how many huge pages are there under current
> > > kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
> > > skip it.  Just a wild idea..
> > >
> > > Or maybe it's intended to try split unconditionally for some reason?  If so
> > > it'll be great to mention that either in the commit message or in comments.
> >
> > Thanks for calling this out. Could the same be said about the existing
> > code that unconditionally tries to write-protect 2M+ pages?

They're different because wr-protect can be restored (to be not-wr-protected)
when vcpu threads write to the pages, so they need to be always done.

For huge page split - when it happened during dirty tracking it'll not be
recovered anymore, so it's a one-time thing.

> > I aimed to keep parity with the write-protection calls (always try to split
> > before write-protecting) but I agree there might be opportunities available
> > to skip altogether.

So IMHO it's not about parity but it could be about how easy can it be
implemented, and whether it'll be worth it to add that complexity.

Besides the above accounting idea per-sp, we can have other ways to do this
too, e.g., keeping a bitmap showing which range has been split: that bitmap
will be 2M in granule for x86 because that'll be enough.  We init-all-ones for
the bitmap when start logging for a memslot.

But again maybe it turns out we don't really want that complexity.

IMHO a good start could be the perf numbers (which I asked in the cover letter)
comparing the overhead of 2nd+ iterations of CLEAR_LOG with/without eager page
split.

> >
> > By the way, looking at this code again I think I see some potential bugs:
> >  - I don't think I ever free split_caches in the initially-all-set case.

I saw that it's freed in kvm_mmu_try_split_large_pages(), no?

> >  - What happens if splitting fails the CLEAR_LOG but succeeds the
> > CLEAR_LOG?
> 
> Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here.
> 
> > We would end up propagating the write-protection on the 2M
> > page down to the 4K page. This might cause issues if using PML.

Hmm looks correct.. I'm wondering what will happen with that.

Firstly this should be rare as the 1st split should in 99% cases succeed.

Then if split failed at the 1st attempt, we wr-protected sptes even during pml
during the split.  When written, we'll go the fast page fault and record the
writes too, I think, as we'll apply dirty bit to the new spte so I think it'll
just skip pml.  Looks like we'll be using a mixture of pml+wp but all dirty
will still be captured as exptected?..

There could be leftover wp when stopping dirty logging, but that seems not
directly harmful too.  It'll make things a bit messed up, at least.
Sean Christopherson Dec. 1, 2021, 7:22 p.m. UTC | #5
On Fri, Nov 19, 2021, David Matlack wrote:
> When using initially-all-set, large pages are not write-protected when
> dirty logging is enabled on the memslot. Instead they are
> write-protected once userspace invoked CLEAR_DIRTY_LOG for the first
> time, and only for the specific sub-region of the memslot that userspace
> whishes to clear.
> 
> Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to
> write-protecting to avoid causing write-protection faults on vCPU
> threads. This also allows userspace to smear the cost of large page
> splitting across multiple ioctls rather than splitting the entire
> memslot when not using initially-all-set.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  4 ++++
>  arch/x86/kvm/mmu/mmu.c          | 30 ++++++++++++++++++++++--------
>  2 files changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 432a4df817ec..6b5bf99f57af 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1591,6 +1591,10 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  				      const struct kvm_memory_slot *memslot,
>  				      int start_level);
> +void kvm_mmu_try_split_large_pages(struct kvm *kvm,

I would prefer we use hugepage when possible, mostly because that's the terminology
used by the kernel.  KVM is comically inconsistent, but if we make an effort to use
hugepage when adding new code, hopefully someday we'll have enough inertia to commit
fully to hugepage.

> +				   const struct kvm_memory_slot *memslot,
> +				   u64 start, u64 end,
> +				   int target_level);
>  void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
>  					const struct kvm_memory_slot *memslot,
>  					int target_level);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6768ef9c0891..4e78ef2dd352 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
>  		gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
>  		gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
>  
> +		/*
> +		 * Try to proactively split any large pages down to 4KB so that
> +		 * vCPUs don't have to take write-protection faults.
> +		 */
> +		kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);

This should return a value.  If splitting succeeds, there should be no hugepages
and so walking the page tables to write-protect 2M is unnecessary.  Same for the
previous patch, although skipping the write-protect path is a little less
straightforward in that case.

> +
>  		kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
>  
>  		/* Cross two large pages? */
Ben Gardon Dec. 1, 2021, 7:49 p.m. UTC | #6
On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Nov 19, 2021, David Matlack wrote:
> > When using initially-all-set, large pages are not write-protected when
> > dirty logging is enabled on the memslot. Instead they are
> > write-protected once userspace invoked CLEAR_DIRTY_LOG for the first
> > time, and only for the specific sub-region of the memslot that userspace
> > whishes to clear.
> >
> > Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to
> > write-protecting to avoid causing write-protection faults on vCPU
> > threads. This also allows userspace to smear the cost of large page
> > splitting across multiple ioctls rather than splitting the entire
> > memslot when not using initially-all-set.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  4 ++++
> >  arch/x86/kvm/mmu/mmu.c          | 30 ++++++++++++++++++++++--------
> >  2 files changed, 26 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 432a4df817ec..6b5bf99f57af 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1591,6 +1591,10 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >                                     const struct kvm_memory_slot *memslot,
> >                                     int start_level);
> > +void kvm_mmu_try_split_large_pages(struct kvm *kvm,
>
> I would prefer we use hugepage when possible, mostly because that's the terminology
> used by the kernel.  KVM is comically inconsistent, but if we make an effort to use
> hugepage when adding new code, hopefully someday we'll have enough inertia to commit
> fully to hugepage.

In my mind "huge page" implies 2M and "large page" is generic to 2m
and 1g. (IDK if we settled on a name for 1G pages)
I've definitely been guilty of reinforcing this inconsistent
terminology. (Though it was consistent in my head, of course.) If we
want to pick one and use it everywhere, I'm happy to get onboard with
a standard terminology.

>
> > +                                const struct kvm_memory_slot *memslot,
> > +                                u64 start, u64 end,
> > +                                int target_level);
> >  void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
> >                                       const struct kvm_memory_slot *memslot,
> >                                       int target_level);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6768ef9c0891..4e78ef2dd352 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> >
> > +             /*
> > +              * Try to proactively split any large pages down to 4KB so that
> > +              * vCPUs don't have to take write-protection faults.
> > +              */
> > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
>
> This should return a value.  If splitting succeeds, there should be no hugepages
> and so walking the page tables to write-protect 2M is unnecessary.  Same for the
> previous patch, although skipping the write-protect path is a little less
> straightforward in that case.
>
> > +
> >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> >
> >               /* Cross two large pages? */
Sean Christopherson Dec. 1, 2021, 8:16 p.m. UTC | #7
On Wed, Dec 01, 2021, Ben Gardon wrote:
> On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote:
> > I would prefer we use hugepage when possible, mostly because that's the terminology
> > used by the kernel.  KVM is comically inconsistent, but if we make an effort to use
> > hugepage when adding new code, hopefully someday we'll have enough inertia to commit
> > fully to hugepage.
> 
> In my mind "huge page" implies 2M and "large page" is generic to 2m
> and 1g. (IDK if we settled on a name for 1G pages)

What about 4m PSE pages?  :-)

I'm mostly joking, but it does raise the point that trying to provide unique names
for each size is a bit of a fools errand, especially on non-x86 architectures that
support a broader variety of hugepage sizes.  IMO, the least ambiguous way to refer
to hugepages is to say that everything that isn't a 4k page (or whatever PAGE_SIZE
is on the architecture) is a hugepage, and then explicitly state the size of the
page if it matters.

> I've definitely been guilty of reinforcing this inconsistent
> terminology. (Though it was consistent in my head, of course.) If we
> want to pick one and use it everywhere, I'm happy to get onboard with
> a standard terminology.

I hear you on using "large page", I've had to undo a solid decade of "large page"
terminology from my pre-Linux days.  But for better or worse, the kernel uses
hugepage, e.g. hugetlbfs supports 1gb and 2mb pages.  I think we should follow
the kernel, especially since we have aspirations of unifying more of KVM's MMU
across multiple architectures.
Ben Gardon Dec. 1, 2021, 10:11 p.m. UTC | #8
On Wed, Dec 1, 2021 at 12:17 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Dec 01, 2021, Ben Gardon wrote:
> > On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote:
> > > I would prefer we use hugepage when possible, mostly because that's the terminology
> > > used by the kernel.  KVM is comically inconsistent, but if we make an effort to use
> > > hugepage when adding new code, hopefully someday we'll have enough inertia to commit
> > > fully to hugepage.
> >
> > In my mind "huge page" implies 2M and "large page" is generic to 2m
> > and 1g. (IDK if we settled on a name for 1G pages)
>
> What about 4m PSE pages?  :-)
>
> I'm mostly joking, but it does raise the point that trying to provide unique names
> for each size is a bit of a fools errand, especially on non-x86 architectures that
> support a broader variety of hugepage sizes.  IMO, the least ambiguous way to refer
> to hugepages is to say that everything that isn't a 4k page (or whatever PAGE_SIZE
> is on the architecture) is a hugepage, and then explicitly state the size of the
> page if it matters.
>
> > I've definitely been guilty of reinforcing this inconsistent
> > terminology. (Though it was consistent in my head, of course.) If we
> > want to pick one and use it everywhere, I'm happy to get onboard with
> > a standard terminology.
>
> I hear you on using "large page", I've had to undo a solid decade of "large page"
> terminology from my pre-Linux days.  But for better or worse, the kernel uses
> hugepage, e.g. hugetlbfs supports 1gb and 2mb pages.  I think we should follow
> the kernel, especially since we have aspirations of unifying more of KVM's MMU
> across multiple architectures.

Sounds good to me. I'll keep that in mind in future patches. I'm happy
to call them anything as long as we all use the same terms.
David Matlack Dec. 1, 2021, 10:14 p.m. UTC | #9
On Tue, Nov 30, 2021 at 8:04 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Nov 30, 2021 at 04:17:01PM -0800, David Matlack wrote:
> > On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@google.com> wrote:
> > >
> > > On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index 6768ef9c0891..4e78ef2dd352 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > > > >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> > > > >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> > > > >
> > > > > +             /*
> > > > > +              * Try to proactively split any large pages down to 4KB so that
> > > > > +              * vCPUs don't have to take write-protection faults.
> > > > > +              */
> > > > > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > > > > +
> > > > >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> > > > >
> > > > >               /* Cross two large pages? */
> > > >
> > > > Is it intended to try split every time even if we could have split it already?
> > > > As I remember Paolo mentioned that we can skip split if it's not the 1st
> > > > CLEAR_LOG on the same range, and IIUC that makes sense.
> > > >
> > > > But indeed I don't see a trivial way to know whether this is the first clear of
> > > > this range.  Maybe we can maintain "how many huge pages are there under current
> > > > kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
> > > > skip it.  Just a wild idea..
> > > >
> > > > Or maybe it's intended to try split unconditionally for some reason?  If so
> > > > it'll be great to mention that either in the commit message or in comments.
> > >
> > > Thanks for calling this out. Could the same be said about the existing
> > > code that unconditionally tries to write-protect 2M+ pages?
>
> They're different because wr-protect can be restored (to be not-wr-protected)
> when vcpu threads write to the pages, so they need to be always done.

That's true for 4K pages, but not for write-protecting 2M+ pages
(which is what we're discussing here). Once KVM write-protects a 2M+
page, it should never need to write-protect it again, but we always
try to here. Same goes with splitting.

>
> For huge page split - when it happened during dirty tracking it'll not be
> recovered anymore, so it's a one-time thing.
>
> > > I aimed to keep parity with the write-protection calls (always try to split
> > > before write-protecting) but I agree there might be opportunities available
> > > to skip altogether.
>
> So IMHO it's not about parity but it could be about how easy can it be
> implemented, and whether it'll be worth it to add that complexity.

Agreed.

>
> Besides the above accounting idea per-sp, we can have other ways to do this
> too, e.g., keeping a bitmap showing which range has been split: that bitmap
> will be 2M in granule for x86 because that'll be enough.  We init-all-ones for
> the bitmap when start logging for a memslot.
>
> But again maybe it turns out we don't really want that complexity.
>
> IMHO a good start could be the perf numbers (which I asked in the cover letter)
> comparing the overhead of 2nd+ iterations of CLEAR_LOG with/without eager page
> split.

Ack. I'll be sure to include these in v1!

>
> > >
> > > By the way, looking at this code again I think I see some potential bugs:
> > >  - I don't think I ever free split_caches in the initially-all-set case.
>
> I saw that it's freed in kvm_mmu_try_split_large_pages(), no?

Ah yes you are right. I misremembered how I implemented it and thought
we kept the split_caches around across calls to CLEAR_LOG. (We
probably should TBH. The current implementation is quite wasteful.)

>
> > >  - What happens if splitting fails the CLEAR_LOG but succeeds the
> > > CLEAR_LOG?
> >
> > Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here.
> >
> > > We would end up propagating the write-protection on the 2M
> > > page down to the 4K page. This might cause issues if using PML.
>
> Hmm looks correct.. I'm wondering what will happen with that.
>
> Firstly this should be rare as the 1st split should in 99% cases succeed.
>
> Then if split failed at the 1st attempt, we wr-protected sptes even during pml
> during the split.  When written, we'll go the fast page fault and record the
> writes too, I think, as we'll apply dirty bit to the new spte so I think it'll
> just skip pml.  Looks like we'll be using a mixture of pml+wp but all dirty
> will still be captured as exptected?..

That's what I was hoping for. I'll double check for v1.

>
> There could be leftover wp when stopping dirty logging, but that seems not
> directly harmful too.  It'll make things a bit messed up, at least.
>
> --
> Peter Xu
>
David Matlack Dec. 1, 2021, 10:17 p.m. UTC | #10
On Wed, Dec 1, 2021 at 11:22 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Nov 19, 2021, David Matlack wrote:
> > When using initially-all-set, large pages are not write-protected when
> > dirty logging is enabled on the memslot. Instead they are
> > write-protected once userspace invoked CLEAR_DIRTY_LOG for the first
> > time, and only for the specific sub-region of the memslot that userspace
> > whishes to clear.
> >
> > Enhance CLEAR_DIRTY_LOG to also try to split large pages prior to
> > write-protecting to avoid causing write-protection faults on vCPU
> > threads. This also allows userspace to smear the cost of large page
> > splitting across multiple ioctls rather than splitting the entire
> > memslot when not using initially-all-set.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  4 ++++
> >  arch/x86/kvm/mmu/mmu.c          | 30 ++++++++++++++++++++++--------
> >  2 files changed, 26 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 432a4df817ec..6b5bf99f57af 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1591,6 +1591,10 @@ void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >                                     const struct kvm_memory_slot *memslot,
> >                                     int start_level);
> > +void kvm_mmu_try_split_large_pages(struct kvm *kvm,
>
> I would prefer we use hugepage when possible, mostly because that's the terminology
> used by the kernel.  KVM is comically inconsistent, but if we make an effort to use
> hugepage when adding new code, hopefully someday we'll have enough inertia to commit
> fully to hugepage.

Will do.

>
> > +                                const struct kvm_memory_slot *memslot,
> > +                                u64 start, u64 end,
> > +                                int target_level);
> >  void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
> >                                       const struct kvm_memory_slot *memslot,
> >                                       int target_level);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6768ef9c0891..4e78ef2dd352 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> >
> > +             /*
> > +              * Try to proactively split any large pages down to 4KB so that
> > +              * vCPUs don't have to take write-protection faults.
> > +              */
> > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
>
> This should return a value.  If splitting succeeds, there should be no hugepages
> and so walking the page tables to write-protect 2M is unnecessary.  Same for the
> previous patch, although skipping the write-protect path is a little less
> straightforward in that case.

Great idea! Will do.

>
> > +
> >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> >
> >               /* Cross two large pages? */
Peter Xu Dec. 3, 2021, 4:57 a.m. UTC | #11
On Wed, Dec 01, 2021 at 02:14:27PM -0800, David Matlack wrote:
> > > > Thanks for calling this out. Could the same be said about the existing
> > > > code that unconditionally tries to write-protect 2M+ pages?
> >
> > They're different because wr-protect can be restored (to be not-wr-protected)
> > when vcpu threads write to the pages, so they need to be always done.
> 
> That's true for 4K pages, but not for write-protecting 2M+ pages
> (which is what we're discussing here). Once KVM write-protects a 2M+
> page, it should never need to write-protect it again, but we always
> try to here. Same goes with splitting.

Ah I see, that's fair point. :)

Yeah let's see how it goes with the numbers, I'd hope it's trivial to do both
wr-protect 2m and the split unconditionally, because for CLEAR_LOG the major
overhead should be walking the small pages instead, afaiu.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 432a4df817ec..6b5bf99f57af 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1591,6 +1591,10 @@  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 				      const struct kvm_memory_slot *memslot,
 				      int start_level);
+void kvm_mmu_try_split_large_pages(struct kvm *kvm,
+				   const struct kvm_memory_slot *memslot,
+				   u64 start, u64 end,
+				   int target_level);
 void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
 					const struct kvm_memory_slot *memslot,
 					int target_level);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6768ef9c0891..4e78ef2dd352 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1448,6 +1448,12 @@  void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
 		gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
 
+		/*
+		 * Try to proactively split any large pages down to 4KB so that
+		 * vCPUs don't have to take write-protection faults.
+		 */
+		kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
+
 		kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
 
 		/* Cross two large pages? */
@@ -5880,21 +5886,17 @@  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
 }
 
-void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
-					const struct kvm_memory_slot *memslot,
-					int target_level)
+void kvm_mmu_try_split_large_pages(struct kvm *kvm,
+				   const struct kvm_memory_slot *memslot,
+				   u64 start, u64 end,
+				   int target_level)
 {
-	u64 start, end;
-
 	if (!is_tdp_mmu_enabled(kvm))
 		return;
 
 	if (mmu_topup_split_caches(kvm))
 		return;
 
-	start = memslot->base_gfn;
-	end = start + memslot->npages;
-
 	read_lock(&kvm->mmu_lock);
 	kvm_tdp_mmu_try_split_large_pages(kvm, memslot, start, end, target_level);
 	read_unlock(&kvm->mmu_lock);
@@ -5902,6 +5904,18 @@  void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
 	mmu_free_split_caches(kvm);
 }
 
+void kvm_mmu_slot_try_split_large_pages(struct kvm *kvm,
+					const struct kvm_memory_slot *memslot,
+					int target_level)
+{
+	u64 start, end;
+
+	start = memslot->base_gfn;
+	end = start + memslot->npages;
+
+	kvm_mmu_try_split_large_pages(kvm, memslot, start, end, target_level);
+}
+
 static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 					 struct kvm_rmap_head *rmap_head,
 					 const struct kvm_memory_slot *slot)