diff mbox series

[v2,07/20] kvm: x86/mmu: Support zapping SPTEs in the TDP MMU

Message ID 20201014182700.2888246-8-bgardon@google.com (mailing list archive)
State New, archived
Headers show
Series Introduce the TDP MMU | expand

Commit Message

Ben Gardon Oct. 14, 2020, 6:26 p.m. UTC
Add functions to zap SPTEs to the TDP MMU. These are needed to tear down
TDP MMU roots properly and implement other MMU functions which require
tearing down mappings. Future patches will add functions to populate the
page tables, but as for this patch there will not be any work for these
functions to do.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c      |  15 +++++
 arch/x86/kvm/mmu/tdp_iter.c |   5 ++
 arch/x86/kvm/mmu/tdp_iter.h |   1 +
 arch/x86/kvm/mmu/tdp_mmu.c  | 109 ++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h  |   2 +
 5 files changed, 132 insertions(+)

Comments

Edgecombe, Rick P Oct. 19, 2020, 8:49 p.m. UTC | #1
On Wed, 2020-10-14 at 11:26 -0700, Ben Gardon wrote:
> @@ -5827,6 +5831,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> gfn_start, gfn_t gfn_end)
>         struct kvm_memslots *slots;
>         struct kvm_memory_slot *memslot;
>         int i;
> +       bool flush;
>  
>         spin_lock(&kvm->mmu_lock);
>         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> @@ -5846,6 +5851,12 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> gfn_start, gfn_t gfn_end)
>                 }
>         }
>  
> +       if (kvm->arch.tdp_mmu_enabled) {
> +               flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start,
> gfn_end);
> +               if (flush)
> +                       kvm_flush_remote_tlbs(kvm);
> +       }
> +
>         spin_unlock(&kvm->mmu_lock);
>  }

Hi,

I'm just going through this looking at how I might integrate some other
MMU changes I had been working on. But as long as I am, I'll toss out
an extremely small comment that the "flush" bool seems unnecessary.

I'm also wondering a bit about this function in general. It seems that
this change adds an extra flush in the nested case, but this operation
already flushed for each memslot in order to facilitate the spin break.
If slot_handle_level_range() took some extra parameters it could maybe
be avoided. Not sure if it's worth it.

Rick
Ben Gardon Oct. 19, 2020, 9:33 p.m. UTC | #2
On Mon, Oct 19, 2020 at 1:50 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2020-10-14 at 11:26 -0700, Ben Gardon wrote:
> > @@ -5827,6 +5831,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> > gfn_start, gfn_t gfn_end)
> >         struct kvm_memslots *slots;
> >         struct kvm_memory_slot *memslot;
> >         int i;
> > +       bool flush;
> >
> >         spin_lock(&kvm->mmu_lock);
> >         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > @@ -5846,6 +5851,12 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> > gfn_start, gfn_t gfn_end)
> >                 }
> >         }
> >
> > +       if (kvm->arch.tdp_mmu_enabled) {
> > +               flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start,
> > gfn_end);
> > +               if (flush)
> > +                       kvm_flush_remote_tlbs(kvm);
> > +       }
> > +
> >         spin_unlock(&kvm->mmu_lock);
> >  }
>
> Hi,
>
> I'm just going through this looking at how I might integrate some other
> MMU changes I had been working on. But as long as I am, I'll toss out
> an extremely small comment that the "flush" bool seems unnecessary.

I agree this could easily be replaced with:
if (kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end))
        kvm_flush_remote_tlbs(kvm);

I like the flush variable just because I think it gives a little more
explanation to the code, but I agree both are perfectly good.

>
> I'm also wondering a bit about this function in general. It seems that
> this change adds an extra flush in the nested case, but this operation
> already flushed for each memslot in order to facilitate the spin break.
> If slot_handle_level_range() took some extra parameters it could maybe
> be avoided. Not sure if it's worth it.

I agree, there's a lot of room for optimization here to reduce the
number of TLB flushes. In this series I have not been too concerned
about optimizing performance. I wanted it to be easy to review and to
minimize the number of bugs in the code.

Future patch series will optimize the TDP MMU and make it actually
performant. Two specific changes I have planned to reduce the number
of TLB flushes are 1.) a deferred TLB flush scheme using the existing
vm-global tlbs_dirty count and 2.) a system for skipping the "legacy
MMU" handlers for various operations if the TDP MMU is enabled and the
"legacy MMU" has not been used on that VM. I believe both of these are
present in the original RFC I sent out a year ago if you're
interested. I'll CC you on those future optimizations.

>
> Rick
Yu Zhang Oct. 21, 2020, 3:02 p.m. UTC | #3
On Wed, Oct 14, 2020 at 11:26:47AM -0700, Ben Gardon wrote:
> Add functions to zap SPTEs to the TDP MMU. These are needed to tear down
> TDP MMU roots properly and implement other MMU functions which require
> tearing down mappings. Future patches will add functions to populate the
> page tables, but as for this patch there will not be any work for these
> functions to do.
> 
> Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> machine. This series introduced no new failures.
> 
> This series can be viewed in Gerrit at:
> 	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c      |  15 +++++
>  arch/x86/kvm/mmu/tdp_iter.c |   5 ++
>  arch/x86/kvm/mmu/tdp_iter.h |   1 +
>  arch/x86/kvm/mmu/tdp_mmu.c  | 109 ++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.h  |   2 +
>  5 files changed, 132 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8bf20723c6177..337ab6823e312 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5787,6 +5787,10 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  	kvm_reload_remote_mmus(kvm);
>  
>  	kvm_zap_obsolete_pages(kvm);
> +
> +	if (kvm->arch.tdp_mmu_enabled)
> +		kvm_tdp_mmu_zap_all(kvm);
> +
>  	spin_unlock(&kvm->mmu_lock);
>  }
>  
> @@ -5827,6 +5831,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  	struct kvm_memslots *slots;
>  	struct kvm_memory_slot *memslot;
>  	int i;
> +	bool flush;
>  
>  	spin_lock(&kvm->mmu_lock);
>  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> @@ -5846,6 +5851,12 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  		}
>  	}
>  
> +	if (kvm->arch.tdp_mmu_enabled) {
> +		flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
> +		if (flush)
> +			kvm_flush_remote_tlbs(kvm);
> +	}
> +
>  	spin_unlock(&kvm->mmu_lock);
>  }
>  
> @@ -6012,6 +6023,10 @@ void kvm_mmu_zap_all(struct kvm *kvm)
>  	}
>  
>  	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +
> +	if (kvm->arch.tdp_mmu_enabled)
> +		kvm_tdp_mmu_zap_all(kvm);
> +
>  	spin_unlock(&kvm->mmu_lock);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index b07e9f0c5d4aa..701eb753b701e 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -174,3 +174,8 @@ void tdp_iter_refresh_walk(struct tdp_iter *iter)
>  		       iter->root_level, iter->min_level, goal_gfn);
>  }
>  
> +u64 *tdp_iter_root_pt(struct tdp_iter *iter)
> +{
> +	return iter->pt_path[iter->root_level - 1];
> +}
> +
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index d629a53e1b73f..884ed2c70bfed 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -52,5 +52,6 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
>  		    int min_level, gfn_t goal_gfn);
>  void tdp_iter_next(struct tdp_iter *iter);
>  void tdp_iter_refresh_walk(struct tdp_iter *iter);
> +u64 *tdp_iter_root_pt(struct tdp_iter *iter);
>  
>  #endif /* __KVM_X86_MMU_TDP_ITER_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f2bd3a6928ce9..9b5cd4a832f1a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -56,8 +56,13 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
>  	return sp->tdp_mmu_page && sp->root_count;
>  }
>  
> +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> +			  gfn_t start, gfn_t end);
> +
>  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>  {
> +	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
> +

boot_cpu_data.x86_phys_bits is the host address width. Value of the guest's
may vary. So maybe we should just traverse the memslots and zap the gfn ranges
in each of them?

>  	lockdep_assert_held(&kvm->mmu_lock);
>  
>  	WARN_ON(root->root_count);
> @@ -65,6 +70,8 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>  
>  	list_del(&root->link);
>  
> +	zap_gfn_range(kvm, root, 0, max_gfn);
> +
>  	free_page((unsigned long)root->spt);
>  	kmem_cache_free(mmu_page_header_cache, root);
>  }
> @@ -155,6 +162,11 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  				u64 old_spte, u64 new_spte, int level);
>  
> +static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
> +{
> +	return sp->role.smm ? 1 : 0;
> +}
> +
>  /**
>   * handle_changed_spte - handle bookkeeping associated with an SPTE change
>   * @kvm: kvm instance
> @@ -262,3 +274,100 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  {
>  	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
>  }
> +
> +static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> +				    u64 new_spte)
> +{
> +	u64 *root_pt = tdp_iter_root_pt(iter);
> +	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
> +	int as_id = kvm_mmu_page_as_id(root);
> +
> +	*iter->sptep = new_spte;
> +
> +	handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
> +			    iter->level);
> +}
> +
> +#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
> +	for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
> +
> +static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> +{
> +	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> +		kvm_flush_remote_tlbs(kvm);
> +		cond_resched_lock(&kvm->mmu_lock);
> +		tdp_iter_refresh_walk(iter);
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * Tears down the mappings for the range of gfns, [start, end), and frees the
> + * non-root pages mapping GFNs strictly within that range. Returns true if
> + * SPTEs have been cleared and a TLB flush is needed before releasing the
> + * MMU lock.
> + */
> +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> +			  gfn_t start, gfn_t end)
> +{
> +	struct tdp_iter iter;
> +	bool flush_needed = false;
> +
> +	tdp_root_for_each_pte(iter, root, start, end) {
> +		if (!is_shadow_present_pte(iter.old_spte))
> +			continue;
> +
> +		/*
> +		 * If this is a non-last-level SPTE that covers a larger range
> +		 * than should be zapped, continue, and zap the mappings at a
> +		 * lower level.
> +		 */
> +		if ((iter.gfn < start ||
> +		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> +		    !is_last_spte(iter.old_spte, iter.level))
> +			continue;
> +
> +		tdp_mmu_set_spte(kvm, &iter, 0);
> +
> +		flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
> +	}
> +	return flush_needed;
> +}
> +
> +/*
> + * Tears down the mappings for the range of gfns, [start, end), and frees the
> + * non-root pages mapping GFNs strictly within that range. Returns true if
> + * SPTEs have been cleared and a TLB flush is needed before releasing the
> + * MMU lock.
> + */
> +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	struct kvm_mmu_page *root;
> +	bool flush = false;
> +
> +	for_each_tdp_mmu_root(kvm, root) {
> +		/*
> +		 * Take a reference on the root so that it cannot be freed if
> +		 * this thread releases the MMU lock and yields in this loop.
> +		 */
> +		get_tdp_mmu_root(kvm, root);
> +
> +		flush |= zap_gfn_range(kvm, root, start, end);
> +
> +		put_tdp_mmu_root(kvm, root);
> +	}
> +
> +	return flush;
> +}
> +
> +void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> +{
> +	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
> +	bool flush;
> +
> +	flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn);
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +}
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index ac0ef91294420..6de2d007fc03c 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -12,4 +12,6 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
>  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
>  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
>  
> +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> -- 
> 2.28.0.1011.ga647a8990f-goog
> 

B.R.
Yu
Paolo Bonzini Oct. 21, 2020, 5:20 p.m. UTC | #4
On 21/10/20 17:02, Yu Zhang wrote:
>>  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>>  {
>> +	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
>> +
> boot_cpu_data.x86_phys_bits is the host address width. Value of the guest's
> may vary. So maybe we should just traverse the memslots and zap the gfn ranges
> in each of them?
> 

It must be smaller than the host value for two-dimensional paging, though.

Paolo
Yu Zhang Oct. 21, 2020, 5:24 p.m. UTC | #5
On Wed, Oct 21, 2020 at 07:20:15PM +0200, Paolo Bonzini wrote:
> On 21/10/20 17:02, Yu Zhang wrote:
> >>  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
> >>  {
> >> +	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
> >> +
> > boot_cpu_data.x86_phys_bits is the host address width. Value of the guest's
> > may vary. So maybe we should just traverse the memslots and zap the gfn ranges
> > in each of them?
> > 
> 
> It must be smaller than the host value for two-dimensional paging, though.

Yes. And using boot_cpu_data.x86_phys_bits works, but won't it be somewhat
overkilling? E.g. for a host with 46 bits and a guest with 39 bits width?

Any concern for doing the zap by going through the memslots? Thanks. :)

B.R.
Yu
> 
> Paolo
>
Paolo Bonzini Oct. 21, 2020, 6 p.m. UTC | #6
On 21/10/20 19:24, Yu Zhang wrote:
> On Wed, Oct 21, 2020 at 07:20:15PM +0200, Paolo Bonzini wrote:
>> On 21/10/20 17:02, Yu Zhang wrote:
>>>>  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
>>>>  {
>>>> +	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
>>>> +
>>> boot_cpu_data.x86_phys_bits is the host address width. Value of the guest's
>>> may vary. So maybe we should just traverse the memslots and zap the gfn ranges
>>> in each of them?
>>>
>>
>> It must be smaller than the host value for two-dimensional paging, though.
> 
> Yes. And using boot_cpu_data.x86_phys_bits works, but won't it be somewhat
> overkilling? E.g. for a host with 46 bits and a guest with 39 bits width?

It would go quickly through extra memory space because the PML4E entries
above the first would be empty.  So it's just 511 comparisons.

Paolo
Yu Zhang Oct. 22, 2020, 2:24 a.m. UTC | #7
On Wed, Oct 21, 2020 at 08:00:47PM +0200, Paolo Bonzini wrote:
> On 21/10/20 19:24, Yu Zhang wrote:
> > On Wed, Oct 21, 2020 at 07:20:15PM +0200, Paolo Bonzini wrote:
> >> On 21/10/20 17:02, Yu Zhang wrote:
> >>>>  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
> >>>>  {
> >>>> +	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
> >>>> +
> >>> boot_cpu_data.x86_phys_bits is the host address width. Value of the guest's
> >>> may vary. So maybe we should just traverse the memslots and zap the gfn ranges
> >>> in each of them?
> >>>
> >>
> >> It must be smaller than the host value for two-dimensional paging, though.
> > 
> > Yes. And using boot_cpu_data.x86_phys_bits works, but won't it be somewhat
> > overkilling? E.g. for a host with 46 bits and a guest with 39 bits width?
> 
> It would go quickly through extra memory space because the PML4E entries
> above the first would be empty.  So it's just 511 comparisons.
> 

Oh, yes. The overhead seems not as big as I assumed. :)

Yu
> Paolo
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8bf20723c6177..337ab6823e312 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5787,6 +5787,10 @@  static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	kvm_reload_remote_mmus(kvm);
 
 	kvm_zap_obsolete_pages(kvm);
+
+	if (kvm->arch.tdp_mmu_enabled)
+		kvm_tdp_mmu_zap_all(kvm);
+
 	spin_unlock(&kvm->mmu_lock);
 }
 
@@ -5827,6 +5831,7 @@  void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *memslot;
 	int i;
+	bool flush;
 
 	spin_lock(&kvm->mmu_lock);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
@@ -5846,6 +5851,12 @@  void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 		}
 	}
 
+	if (kvm->arch.tdp_mmu_enabled) {
+		flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
+		if (flush)
+			kvm_flush_remote_tlbs(kvm);
+	}
+
 	spin_unlock(&kvm->mmu_lock);
 }
 
@@ -6012,6 +6023,10 @@  void kvm_mmu_zap_all(struct kvm *kvm)
 	}
 
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+
+	if (kvm->arch.tdp_mmu_enabled)
+		kvm_tdp_mmu_zap_all(kvm);
+
 	spin_unlock(&kvm->mmu_lock);
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index b07e9f0c5d4aa..701eb753b701e 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -174,3 +174,8 @@  void tdp_iter_refresh_walk(struct tdp_iter *iter)
 		       iter->root_level, iter->min_level, goal_gfn);
 }
 
+u64 *tdp_iter_root_pt(struct tdp_iter *iter)
+{
+	return iter->pt_path[iter->root_level - 1];
+}
+
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index d629a53e1b73f..884ed2c70bfed 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -52,5 +52,6 @@  void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
 		    int min_level, gfn_t goal_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_refresh_walk(struct tdp_iter *iter);
+u64 *tdp_iter_root_pt(struct tdp_iter *iter);
 
 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f2bd3a6928ce9..9b5cd4a832f1a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -56,8 +56,13 @@  bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
 	return sp->tdp_mmu_page && sp->root_count;
 }
 
+static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+			  gfn_t start, gfn_t end);
+
 void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
 {
+	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
+
 	lockdep_assert_held(&kvm->mmu_lock);
 
 	WARN_ON(root->root_count);
@@ -65,6 +70,8 @@  void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
 
 	list_del(&root->link);
 
+	zap_gfn_range(kvm, root, 0, max_gfn);
+
 	free_page((unsigned long)root->spt);
 	kmem_cache_free(mmu_page_header_cache, root);
 }
@@ -155,6 +162,11 @@  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				u64 old_spte, u64 new_spte, int level);
 
+static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
+{
+	return sp->role.smm ? 1 : 0;
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -262,3 +274,100 @@  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 {
 	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
 }
+
+static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
+				    u64 new_spte)
+{
+	u64 *root_pt = tdp_iter_root_pt(iter);
+	struct kvm_mmu_page *root = sptep_to_sp(root_pt);
+	int as_id = kvm_mmu_page_as_id(root);
+
+	*iter->sptep = new_spte;
+
+	handle_changed_spte(kvm, as_id, iter->gfn, iter->old_spte, new_spte,
+			    iter->level);
+}
+
+#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
+	for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
+
+static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+{
+	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		kvm_flush_remote_tlbs(kvm);
+		cond_resched_lock(&kvm->mmu_lock);
+		tdp_iter_refresh_walk(iter);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Tears down the mappings for the range of gfns, [start, end), and frees the
+ * non-root pages mapping GFNs strictly within that range. Returns true if
+ * SPTEs have been cleared and a TLB flush is needed before releasing the
+ * MMU lock.
+ */
+static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+			  gfn_t start, gfn_t end)
+{
+	struct tdp_iter iter;
+	bool flush_needed = false;
+
+	tdp_root_for_each_pte(iter, root, start, end) {
+		if (!is_shadow_present_pte(iter.old_spte))
+			continue;
+
+		/*
+		 * If this is a non-last-level SPTE that covers a larger range
+		 * than should be zapped, continue, and zap the mappings at a
+		 * lower level.
+		 */
+		if ((iter.gfn < start ||
+		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
+		    !is_last_spte(iter.old_spte, iter.level))
+			continue;
+
+		tdp_mmu_set_spte(kvm, &iter, 0);
+
+		flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
+	}
+	return flush_needed;
+}
+
+/*
+ * Tears down the mappings for the range of gfns, [start, end), and frees the
+ * non-root pages mapping GFNs strictly within that range. Returns true if
+ * SPTEs have been cleared and a TLB flush is needed before releasing the
+ * MMU lock.
+ */
+bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct kvm_mmu_page *root;
+	bool flush = false;
+
+	for_each_tdp_mmu_root(kvm, root) {
+		/*
+		 * Take a reference on the root so that it cannot be freed if
+		 * this thread releases the MMU lock and yields in this loop.
+		 */
+		get_tdp_mmu_root(kvm, root);
+
+		flush |= zap_gfn_range(kvm, root, start, end);
+
+		put_tdp_mmu_root(kvm, root);
+	}
+
+	return flush;
+}
+
+void kvm_tdp_mmu_zap_all(struct kvm *kvm)
+{
+	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
+	bool flush;
+
+	flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn);
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index ac0ef91294420..6de2d007fc03c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -12,4 +12,6 @@  bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
 void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
+bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */