diff mbox series

[15/15] KVM: x86/mmu: Promote pages in-place when disabling dirty logging

Message ID 20211115234603.2908381-16-bgardon@google.com (mailing list archive)
State New, archived
Headers show
Series Currently disabling dirty logging with the TDP MMU is extremely slow. On a 96 vCPU / 96G VM it takes ~45 seconds to disable dirty logging with the TDP MMU, as opposed to ~3.5 seconds with the legacy MMU. This series optimizes TLB flushes and intro | expand

Commit Message

Ben Gardon Nov. 15, 2021, 11:46 p.m. UTC
When disabling dirty logging, the TDP MMU currently zaps each leaf entry
mapping memory in the relevant memslot. This is very slow. Doing the zaps
under the mmu read lock requires a TLB flush for every zap and the
zapping causes a storm of ETP/NPT violations.

Instead of zapping, replace the split large pages with large page
mappings directly. While this sort of operation has historically only
been done in the vCPU page fault handler context, refactorings earlier
in this series and the relative simplicity of the TDP MMU make it
possible here as well.

Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
of memory per vCPU, this reduces the time required to disable dirty
logging from over 45 seconds to just over 1 second. It also avoids
provoking page faults, improving vCPU performance while disabling
dirty logging.


Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/mmu/mmu_internal.h |  4 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 69 ++++++++++++++++++++++++++++++++-
 3 files changed, 72 insertions(+), 3 deletions(-)

Comments

Peter Xu Nov. 25, 2021, 4:18 a.m. UTC | #1
Hi, Ben,

On Mon, Nov 15, 2021 at 03:46:03PM -0800, Ben Gardon wrote:
> When disabling dirty logging, the TDP MMU currently zaps each leaf entry
> mapping memory in the relevant memslot. This is very slow. Doing the zaps
> under the mmu read lock requires a TLB flush for every zap and the
> zapping causes a storm of ETP/NPT violations.
> 
> Instead of zapping, replace the split large pages with large page
> mappings directly. While this sort of operation has historically only
> been done in the vCPU page fault handler context, refactorings earlier
> in this series and the relative simplicity of the TDP MMU make it
> possible here as well.

Thanks for this patch, it looks very useful.

I've got a few comments below, but before that I've also got one off-topic
question too; it'll be great if you can help answer.

When I was looking into how the old code recovers the huge pages I found that
we'll leave the full-zero pgtable page there until the next page fault, then I
_think_ it'll be released only until the __handle_changed_spte() when we're
dropping the old spte (handle_removed_tdp_mmu_page).

As comment above handle_removed_tdp_mmu_page() showed, at this point IIUC
current thread should have exclusive ownership of this orphaned and abandoned
pgtable page, then why in handle_removed_tdp_mmu_page() we still need all the
atomic operations and REMOVED_SPTE tricks to protect from concurrent access?
Since that's cmpxchg-ed out of the old pgtable, what can be accessing it
besides the current thread?

> 
> Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
> of memory per vCPU, this reduces the time required to disable dirty
> logging from over 45 seconds to just over 1 second. It also avoids
> provoking page faults, improving vCPU performance while disabling
> dirty logging.
> 
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  arch/x86/kvm/mmu/mmu_internal.h |  4 ++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 69 ++++++++++++++++++++++++++++++++-
>  3 files changed, 72 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ef7a84422463..add724aa9e8c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4449,7 +4449,7 @@ static inline bool boot_cpu_is_amd(void)
>   * the direct page table on host, use as much mmu features as
>   * possible, however, kvm currently does not do execution-protection.
>   */
> -static void
> +void
>  build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
>  				int shadow_root_level)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 6563cce9c438..84d439432acf 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -161,4 +161,8 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  
> +void
> +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> +				int shadow_root_level);
> +
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 43c7834b4f0a..b15c8cd11cf9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1361,6 +1361,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>  		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
>  }
>  
> +static void try_promote_lpage(struct kvm *kvm,
> +			      const struct kvm_memory_slot *slot,
> +			      struct tdp_iter *iter)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> +	struct rsvd_bits_validate shadow_zero_check;
> +	/*
> +	 * Since the TDP  MMU doesn't manage nested PTs, there's no need to
> +	 * write protect for a nested VM when PML is in use.
> +	 */
> +	bool ad_need_write_protect = false;

Shall we just pass in "false" in make_spte() and just move the comment there?

> +	bool map_writable;
> +	kvm_pfn_t pfn;
> +	u64 new_spte;
> +	u64 mt_mask;
> +
> +	/*
> +	 * If addresses are being invalidated, don't do in-place promotion to
> +	 * avoid accidentally mapping an invalidated address.
> +	 */
> +	if (unlikely(kvm->mmu_notifier_count))
> +		return;
> +
> +	pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> +				   &map_writable, NULL);

Should we better check pfn validity and bail out otherwise?  E.g. for atomic I
think we can also get KVM_PFN_ERR_FAULT when fast-gup failed somehow.

> +
> +	/*
> +	 * Can't reconstitute an lpage if the consituent pages can't be
> +	 * mapped higher.
> +	 */
> +	if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> +						    pfn, PG_LEVEL_NUM))
> +		return;
> +
> +	build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> +
> +	/*
> +	 * In some cases, a vCPU pointer is required to get the MT mask,
> +	 * however in most cases it can be generated without one. If a
> +	 * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> +	 * In that case, bail on in-place promotion.
> +	 */
> +	if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
> +							   kvm_is_mmio_pfn(pfn),
> +							   &mt_mask)))
> +		return;
> +
> +	make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
> +		  map_writable, ad_need_write_protect, mt_mask,
> +		  &shadow_zero_check, &new_spte);
> +
> +	tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> +
> +	/*
> +	 * Re-read the SPTE to avoid recursing into one of the removed child
> +	 * page tables.
> +	 */
> +	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));

Is this redundant since it seems the iterator logic handles this already, I'm
reading try_step_down() here:

	/*
	 * Reread the SPTE before stepping down to avoid traversing into page
	 * tables that are no longer linked from this entry.
	 */
	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));

The rest looks good to me, thanks.

> +}
> +
>  /*
>   * Clear leaf entries which could be replaced by large mappings, for
>   * GFNs within the slot.
> @@ -1381,9 +1441,14 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>  		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>  			continue;
>  
> -		if (!is_shadow_present_pte(iter.old_spte) ||
> -		    !is_last_spte(iter.old_spte, iter.level))
> +		if (!is_shadow_present_pte(iter.old_spte))
> +			continue;
> +
> +		/* Try to promote the constitutent pages to an lpage. */
> +		if (!is_last_spte(iter.old_spte, iter.level)) {
> +			try_promote_lpage(kvm, slot, &iter);
>  			continue;
> +		}
>  
>  		pfn = spte_to_pfn(iter.old_spte);
>  		if (kvm_is_reserved_pfn(pfn) ||
> -- 
> 2.34.0.rc1.387.gb447b232ab-goog
>
Ben Gardon Nov. 29, 2021, 6:31 p.m. UTC | #2
On Wed, Nov 24, 2021 at 8:19 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Ben,
>
> On Mon, Nov 15, 2021 at 03:46:03PM -0800, Ben Gardon wrote:
> > When disabling dirty logging, the TDP MMU currently zaps each leaf entry
> > mapping memory in the relevant memslot. This is very slow. Doing the zaps
> > under the mmu read lock requires a TLB flush for every zap and the
> > zapping causes a storm of ETP/NPT violations.
> >
> > Instead of zapping, replace the split large pages with large page
> > mappings directly. While this sort of operation has historically only
> > been done in the vCPU page fault handler context, refactorings earlier
> > in this series and the relative simplicity of the TDP MMU make it
> > possible here as well.
>
> Thanks for this patch, it looks very useful.

Thanks for taking a look!

>
> I've got a few comments below, but before that I've also got one off-topic
> question too; it'll be great if you can help answer.
>
> When I was looking into how the old code recovers the huge pages I found that
> we'll leave the full-zero pgtable page there until the next page fault, then I
> _think_ it'll be released only until the __handle_changed_spte() when we're
> dropping the old spte (handle_removed_tdp_mmu_page).

That seems likely, though Sean's recent series that heavily refactored
zapping probably changed that.

>
> As comment above handle_removed_tdp_mmu_page() showed, at this point IIUC
> current thread should have exclusive ownership of this orphaned and abandoned
> pgtable page, then why in handle_removed_tdp_mmu_page() we still need all the
> atomic operations and REMOVED_SPTE tricks to protect from concurrent access?
> Since that's cmpxchg-ed out of the old pgtable, what can be accessing it
> besides the current thread?

The cmpxchg does nothing to guarantee that other threads can't have a
pointer to the page table, only that this thread knows it's the one
that removed it from the page table. Other threads could still have
pointers to it in two ways:
1. A kernel thread could be in the process of modifying an SPTE in the
page table, under the MMU lock in read mode. In that case, there's no
guarantee that there's not another kernel thread with a pointer to the
SPTE until the end of an RCU grace period.
2. There could be a pointer to the page table in a vCPU's paging
structure caches, which are similar to the TLB but cache partial
translations. These are also cleared out on TLB flush.
Sean's recent series linked the RCU grace period and TLB flush in a
clever way so that we can ensure that the end of a grace period
implies that the necessary flushes have happened already, but we still
need to clear out the disconnected page table with atomic operations.
We need to clear it out mostly to collect dirty / accessed bits and
update page size stats.

>
> >
> > Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
> > of memory per vCPU, this reduces the time required to disable dirty
> > logging from over 45 seconds to just over 1 second. It also avoids
> > provoking page faults, improving vCPU performance while disabling
> > dirty logging.
> >
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c          |  2 +-
> >  arch/x86/kvm/mmu/mmu_internal.h |  4 ++
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 69 ++++++++++++++++++++++++++++++++-
> >  3 files changed, 72 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ef7a84422463..add724aa9e8c 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4449,7 +4449,7 @@ static inline bool boot_cpu_is_amd(void)
> >   * the direct page table on host, use as much mmu features as
> >   * possible, however, kvm currently does not do execution-protection.
> >   */
> > -static void
> > +void
> >  build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> >                               int shadow_root_level)
> >  {
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 6563cce9c438..84d439432acf 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -161,4 +161,8 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +void
> > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> > +                             int shadow_root_level);
> > +
> >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 43c7834b4f0a..b15c8cd11cf9 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1361,6 +1361,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> >               clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
> >  }
> >
> > +static void try_promote_lpage(struct kvm *kvm,
> > +                           const struct kvm_memory_slot *slot,
> > +                           struct tdp_iter *iter)
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> > +     struct rsvd_bits_validate shadow_zero_check;
> > +     /*
> > +      * Since the TDP  MMU doesn't manage nested PTs, there's no need to
> > +      * write protect for a nested VM when PML is in use.
> > +      */
> > +     bool ad_need_write_protect = false;
>
> Shall we just pass in "false" in make_spte() and just move the comment there?

We could, but given the egregious number of arguments to the function
(totally my fault), I think this is probably a bit more readable.

>
> > +     bool map_writable;
> > +     kvm_pfn_t pfn;
> > +     u64 new_spte;
> > +     u64 mt_mask;
> > +
> > +     /*
> > +      * If addresses are being invalidated, don't do in-place promotion to
> > +      * avoid accidentally mapping an invalidated address.
> > +      */
> > +     if (unlikely(kvm->mmu_notifier_count))
> > +             return;
> > +
> > +     pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> > +                                &map_writable, NULL);
>
> Should we better check pfn validity and bail out otherwise?  E.g. for atomic I
> think we can also get KVM_PFN_ERR_FAULT when fast-gup failed somehow.

That's an excellent point. We should absolutely do that.

>
> > +
> > +     /*
> > +      * Can't reconstitute an lpage if the consituent pages can't be
> > +      * mapped higher.
> > +      */
> > +     if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> > +                                                 pfn, PG_LEVEL_NUM))
> > +             return;
> > +
> > +     build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> > +
> > +     /*
> > +      * In some cases, a vCPU pointer is required to get the MT mask,
> > +      * however in most cases it can be generated without one. If a
> > +      * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> > +      * In that case, bail on in-place promotion.
> > +      */
> > +     if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
> > +                                                        kvm_is_mmio_pfn(pfn),
> > +                                                        &mt_mask)))
> > +             return;
> > +
> > +     make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
> > +               map_writable, ad_need_write_protect, mt_mask,
> > +               &shadow_zero_check, &new_spte);
> > +
> > +     tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > +
> > +     /*
> > +      * Re-read the SPTE to avoid recursing into one of the removed child
> > +      * page tables.
> > +      */
> > +     iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
>
> Is this redundant since it seems the iterator logic handles this already, I'm
> reading try_step_down() here:
>
>         /*
>          * Reread the SPTE before stepping down to avoid traversing into page
>          * tables that are no longer linked from this entry.
>          */
>         iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));

Oh, I had forgotten about that, but it'd be great if it was redundant.
I'll double check.

>
> The rest looks good to me, thanks.

Thanks for your review!

>
> > +}
> > +
> >  /*
> >   * Clear leaf entries which could be replaced by large mappings, for
> >   * GFNs within the slot.
> > @@ -1381,9 +1441,14 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> >               if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> >                       continue;
> >
> > -             if (!is_shadow_present_pte(iter.old_spte) ||
> > -                 !is_last_spte(iter.old_spte, iter.level))
> > +             if (!is_shadow_present_pte(iter.old_spte))
> > +                     continue;
> > +
> > +             /* Try to promote the constitutent pages to an lpage. */
> > +             if (!is_last_spte(iter.old_spte, iter.level)) {
> > +                     try_promote_lpage(kvm, slot, &iter);
> >                       continue;
> > +             }
> >
> >               pfn = spte_to_pfn(iter.old_spte);
> >               if (kvm_is_reserved_pfn(pfn) ||
> > --
> > 2.34.0.rc1.387.gb447b232ab-goog
> >
>
> --
> Peter Xu
>
Sean Christopherson Nov. 30, 2021, 12:13 a.m. UTC | #3
On Mon, Nov 29, 2021, Ben Gardon wrote:
> On Wed, Nov 24, 2021 at 8:19 PM Peter Xu <peterx@redhat.com> wrote:
> > I've got a few comments below, but before that I've also got one off-topic
> > question too; it'll be great if you can help answer.
> >
> > When I was looking into how the old code recovers the huge pages I found that
> > we'll leave the full-zero pgtable page there until the next page fault, then I
> > _think_ it'll be released only until the __handle_changed_spte() when we're
> > dropping the old spte (handle_removed_tdp_mmu_page).
> 
> That seems likely, though Sean's recent series that heavily refactored
> zapping probably changed that.

Hmm, no, that behavior shouldn't change for this path in my series.  Only the leaf
SPTEs are zapped, the shadow page hangs around until its replaced by a hugepage,
reclaimed due to memory pressure, etc...  FWIW, that's consistent with the
legacy/full MMU.

Zapping the SP is doable in theory, but it would require completely different
iteration behavior and small amount of additional logic to check the entire gfn
range covered by the SP for compability.  If we were starting from scratch, I
would probably advocate zapping the parent SP directly, but since the net work
done is roughly equivalent, the cost of keeping the page around is negligible,
and we have functional code already...
Peter Xu Nov. 30, 2021, 7:28 a.m. UTC | #4
On Mon, Nov 29, 2021 at 10:31:14AM -0800, Ben Gardon wrote:
> > As comment above handle_removed_tdp_mmu_page() showed, at this point IIUC
> > current thread should have exclusive ownership of this orphaned and abandoned
> > pgtable page, then why in handle_removed_tdp_mmu_page() we still need all the
> > atomic operations and REMOVED_SPTE tricks to protect from concurrent access?
> > Since that's cmpxchg-ed out of the old pgtable, what can be accessing it
> > besides the current thread?
> 
> The cmpxchg does nothing to guarantee that other threads can't have a
> pointer to the page table, only that this thread knows it's the one
> that removed it from the page table. Other threads could still have
> pointers to it in two ways:
> 1. A kernel thread could be in the process of modifying an SPTE in the
> page table, under the MMU lock in read mode. In that case, there's no
> guarantee that there's not another kernel thread with a pointer to the
> SPTE until the end of an RCU grace period.

Right, I definitely missed that whole picture of the RCU usage.  Thanks.

> 2. There could be a pointer to the page table in a vCPU's paging
> structure caches, which are similar to the TLB but cache partial
> translations. These are also cleared out on TLB flush.

Could you elaborate what's the structure cache that you mentioned?  I thought
the processor page walker will just use the data cache (L1-L3) as pgtable
caches, in which case IIUC the invalidation happens when we do WRITE_ONCE()
that'll invalidate all the rest data cache besides the writter core.  But I
could be completely missing something..

> Sean's recent series linked the RCU grace period and TLB flush in a
> clever way so that we can ensure that the end of a grace period
> implies that the necessary flushes have happened already, but we still
> need to clear out the disconnected page table with atomic operations.
> We need to clear it out mostly to collect dirty / accessed bits and
> update page size stats.

Yes, this sounds reasonable too.
Sean Christopherson Nov. 30, 2021, 4:01 p.m. UTC | #5
On Tue, Nov 30, 2021, Peter Xu wrote:
> On Mon, Nov 29, 2021 at 10:31:14AM -0800, Ben Gardon wrote:
> > 2. There could be a pointer to the page table in a vCPU's paging
> > structure caches, which are similar to the TLB but cache partial
> > translations. These are also cleared out on TLB flush.
> 
> Could you elaborate what's the structure cache that you mentioned?  I thought
> the processor page walker will just use the data cache (L1-L3) as pgtable
> caches, in which case IIUC the invalidation happens when we do WRITE_ONCE()
> that'll invalidate all the rest data cache besides the writter core.  But I
> could be completely missing something..

Ben is referring to the Intel SDM's use of the term "paging-structure caches"
Intel CPUs, and I'm guessing other x86 CPUs, cache upper level entries, e.g. the
L4 PTE for a given address, to avoid having to do data cache lookups, reserved
bits checked, A/D assists, etc...   Like full VA=>PA TLB entries, these entries
are associated with the PCID, VPID, EPT4A, etc...

The data caches are still used when reading PTEs that aren't cached in the TLB,
the extra caching in the "TLB" is optimization on top.

  28.3.1 Information That May Be Cached
  Section 4.10, “Caching Translation Information” in Intel® 64 and IA-32 Architectures
  Software Developer’s Manual, Volume 3A identifies two kinds of translation-related
  information that may be cached by a logical processor: translations, which are mappings
  from linear page numbers to physical page frames, and paging-structure caches, which
  map the upper bits of a linear page number to information from the paging-structure
  entries used to translate linear addresses matching those upper bits.
Peter Xu Dec. 1, 2021, 1:59 a.m. UTC | #6
On Tue, Nov 30, 2021 at 04:01:50PM +0000, Sean Christopherson wrote:
> On Tue, Nov 30, 2021, Peter Xu wrote:
> > On Mon, Nov 29, 2021 at 10:31:14AM -0800, Ben Gardon wrote:
> > > 2. There could be a pointer to the page table in a vCPU's paging
> > > structure caches, which are similar to the TLB but cache partial
> > > translations. These are also cleared out on TLB flush.
> > 
> > Could you elaborate what's the structure cache that you mentioned?  I thought
> > the processor page walker will just use the data cache (L1-L3) as pgtable
> > caches, in which case IIUC the invalidation happens when we do WRITE_ONCE()
> > that'll invalidate all the rest data cache besides the writter core.  But I
> > could be completely missing something..
> 
> Ben is referring to the Intel SDM's use of the term "paging-structure caches"
> Intel CPUs, and I'm guessing other x86 CPUs, cache upper level entries, e.g. the
> L4 PTE for a given address, to avoid having to do data cache lookups, reserved
> bits checked, A/D assists, etc...   Like full VA=>PA TLB entries, these entries
> are associated with the PCID, VPID, EPT4A, etc...
> 
> The data caches are still used when reading PTEs that aren't cached in the TLB,
> the extra caching in the "TLB" is optimization on top.
> 
>   28.3.1 Information That May Be Cached
>   Section 4.10, “Caching Translation Information” in Intel® 64 and IA-32 Architectures
>   Software Developer’s Manual, Volume 3A identifies two kinds of translation-related
>   information that may be cached by a logical processor: translations, which are mappings
>   from linear page numbers to physical page frames, and paging-structure caches, which
>   map the upper bits of a linear page number to information from the paging-structure
>   entries used to translate linear addresses matching those upper bits.

Ah, I should have tried harder when reading the spec, where I just stopped at
4.10.2... :) They're also described in general section of 4.10.3 and also on
how TLB invalidations affect these caches in 4.10.4.

Thanks again to both!
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ef7a84422463..add724aa9e8c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4449,7 +4449,7 @@  static inline bool boot_cpu_is_amd(void)
  * the direct page table on host, use as much mmu features as
  * possible, however, kvm currently does not do execution-protection.
  */
-static void
+void
 build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
 				int shadow_root_level)
 {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 6563cce9c438..84d439432acf 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -161,4 +161,8 @@  void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+void
+build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
+				int shadow_root_level);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 43c7834b4f0a..b15c8cd11cf9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1361,6 +1361,66 @@  void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
 }
 
+static void try_promote_lpage(struct kvm *kvm,
+			      const struct kvm_memory_slot *slot,
+			      struct tdp_iter *iter)
+{
+	struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
+	struct rsvd_bits_validate shadow_zero_check;
+	/*
+	 * Since the TDP  MMU doesn't manage nested PTs, there's no need to
+	 * write protect for a nested VM when PML is in use.
+	 */
+	bool ad_need_write_protect = false;
+	bool map_writable;
+	kvm_pfn_t pfn;
+	u64 new_spte;
+	u64 mt_mask;
+
+	/*
+	 * If addresses are being invalidated, don't do in-place promotion to
+	 * avoid accidentally mapping an invalidated address.
+	 */
+	if (unlikely(kvm->mmu_notifier_count))
+		return;
+
+	pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
+				   &map_writable, NULL);
+
+	/*
+	 * Can't reconstitute an lpage if the consituent pages can't be
+	 * mapped higher.
+	 */
+	if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
+						    pfn, PG_LEVEL_NUM))
+		return;
+
+	build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
+
+	/*
+	 * In some cases, a vCPU pointer is required to get the MT mask,
+	 * however in most cases it can be generated without one. If a
+	 * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
+	 * In that case, bail on in-place promotion.
+	 */
+	if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
+							   kvm_is_mmio_pfn(pfn),
+							   &mt_mask)))
+		return;
+
+	make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
+		  map_writable, ad_need_write_protect, mt_mask,
+		  &shadow_zero_check, &new_spte);
+
+	tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
+
+	/*
+	 * Re-read the SPTE to avoid recursing into one of the removed child
+	 * page tables.
+	 */
+	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+}
+
 /*
  * Clear leaf entries which could be replaced by large mappings, for
  * GFNs within the slot.
@@ -1381,9 +1441,14 @@  static void zap_collapsible_spte_range(struct kvm *kvm,
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
 
-		if (!is_shadow_present_pte(iter.old_spte) ||
-		    !is_last_spte(iter.old_spte, iter.level))
+		if (!is_shadow_present_pte(iter.old_spte))
+			continue;
+
+		/* Try to promote the constitutent pages to an lpage. */
+		if (!is_last_spte(iter.old_spte, iter.level)) {
+			try_promote_lpage(kvm, slot, &iter);
 			continue;
+		}
 
 		pfn = spte_to_pfn(iter.old_spte);
 		if (kvm_is_reserved_pfn(pfn) ||