diff mbox series

[1/3] KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot

Message ID 20241009192345.1148353-2-seanjc@google.com (mailing list archive)
State New
Headers show
Series KVM: x86/mmu: Don't zap "direct" non-leaf SPTEs on memslot removal | expand

Commit Message

Sean Christopherson Oct. 9, 2024, 7:23 p.m. UTC
When performing a targeted zap on memslot removal, zap only MMU pages that
shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and
unnecessary.  Furthermore, for_each_gfn_valid_sp() arguably shouldn't
exist, because it doesn't do what most people would it expect it to do.
The "round gfn for level" adjustment that is done for direct SPs (no gPTE)
means that the exact gfn comparison will not get a match, even when a SP
does "cover" a gfn, or was even created specifically for a gfn.

For memslot deletion specifically, KVM's behavior will vary significantly
based on the size and alignment of a memslot, and in weird ways.  E.g. for
a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if
it's only 4KiB aligned.  And as described below, zapping SPs in the
aligned case overzaps for direct MMUs, as odds are good the upper-level
SPs are serving other memslots.

To iterate over all potentially-relevant gfns, KVM would need to make a
pass over the hash table for each level, with the gfn used for lookup
rounded for said level.  And then check that the SP is of the correct
level, too, e.g. to avoid over-zapping.

But even then, KVM would massively overzap, as processing every level is
all but guaranteed to zap SPs that serve other memslots, especially if the
memslot being removed is relatively small.  KVM could mitigate that issue
by processing only levels that can be possible guest huge pages, i.e. are
less likely to be re-used for other memslot, but while somewhat logical,
that's quite arbitrary and would be a bit of a mess to implement.

So, zap only SPs with gPTEs, as the resulting behavior is easy to describe,
is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that
absolutely must be zapped.

Cc: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

Comments

Yan Zhao Oct. 10, 2024, 7:59 a.m. UTC | #1
Tests of "normal VM + nested VM + 3 selftests" passed on the 3 configs
1) modprobe kvm_intel ept=0,
2) modprobe kvm tdp_mmu=0
   modprobe kvm_intel ept=1
3) modprobe kvm tdp_mmu=1
   modprobe kvm_intel ept=1

Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>

On Wed, Oct 09, 2024 at 12:23:43PM -0700, Sean Christopherson wrote:
> When performing a targeted zap on memslot removal, zap only MMU pages that
> shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and
> unnecessary.  Furthermore, for_each_gfn_valid_sp() arguably shouldn't
> exist, because it doesn't do what most people would it expect it to do.
> The "round gfn for level" adjustment that is done for direct SPs (no gPTE)
> means that the exact gfn comparison will not get a match, even when a SP
> does "cover" a gfn, or was even created specifically for a gfn.
> 
> For memslot deletion specifically, KVM's behavior will vary significantly
> based on the size and alignment of a memslot, and in weird ways.  E.g. for
> a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if
> it's only 4KiB aligned.  And as described below, zapping SPs in the
> aligned case overzaps for direct MMUs, as odds are good the upper-level
> SPs are serving other memslots.
> 
> To iterate over all potentially-relevant gfns, KVM would need to make a
> pass over the hash table for each level, with the gfn used for lookup
> rounded for said level.  And then check that the SP is of the correct
> level, too, e.g. to avoid over-zapping.
> 
> But even then, KVM would massively overzap, as processing every level is
> all but guaranteed to zap SPs that serve other memslots, especially if the
> memslot being removed is relatively small.  KVM could mitigate that issue
> by processing only levels that can be possible guest huge pages, i.e. are
> less likely to be re-used for other memslot, but while somewhat logical,
> that's quite arbitrary and would be a bit of a mess to implement.
> 
> So, zap only SPs with gPTEs, as the resulting behavior is easy to describe,
> is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that
> absolutely must be zapped.
> 
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a9a23e058555..09494d01c38e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1884,14 +1884,10 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)
>  		if (is_obsolete_sp((_kvm), (_sp))) {			\
>  		} else
>  
> -#define for_each_gfn_valid_sp(_kvm, _sp, _gfn)				\
> +#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn)		\
>  	for_each_valid_sp(_kvm, _sp,					\
>  	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])	\
> -		if ((_sp)->gfn != (_gfn)) {} else
> -
> -#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn)		\
> -	for_each_gfn_valid_sp(_kvm, _sp, _gfn)				\
> -		if (!sp_has_gptes(_sp)) {} else
> +		if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else
>  
>  static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  {
> @@ -7063,15 +7059,15 @@ static void kvm_mmu_zap_memslot_pages_and_flush(struct kvm *kvm,
>  
>  	/*
>  	 * Since accounting information is stored in struct kvm_arch_memory_slot,
> -	 * shadow pages deletion (e.g. unaccount_shadowed()) requires that all
> -	 * gfns with a shadow page have a corresponding memslot.  Do so before
> -	 * the memslot goes away.
> +	 * all MMU pages that are shadowing guest PTEs must be zapped before the
> +	 * memslot is deleted, as freeing such pages after the memslot is freed
> +	 * will result in use-after-free, e.g. in unaccount_shadowed().
>  	 */
>  	for (i = 0; i < slot->npages; i++) {
>  		struct kvm_mmu_page *sp;
>  		gfn_t gfn = slot->base_gfn + i;
>  
> -		for_each_gfn_valid_sp(kvm, sp, gfn)
> +		for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn)
>  			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
>  
>  		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> -- 
> 2.47.0.rc1.288.g06298d1525-goog
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a9a23e058555..09494d01c38e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1884,14 +1884,10 @@  static bool sp_has_gptes(struct kvm_mmu_page *sp)
 		if (is_obsolete_sp((_kvm), (_sp))) {			\
 		} else
 
-#define for_each_gfn_valid_sp(_kvm, _sp, _gfn)				\
+#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn)		\
 	for_each_valid_sp(_kvm, _sp,					\
 	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])	\
-		if ((_sp)->gfn != (_gfn)) {} else
-
-#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn)		\
-	for_each_gfn_valid_sp(_kvm, _sp, _gfn)				\
-		if (!sp_has_gptes(_sp)) {} else
+		if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else
 
 static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
@@ -7063,15 +7059,15 @@  static void kvm_mmu_zap_memslot_pages_and_flush(struct kvm *kvm,
 
 	/*
 	 * Since accounting information is stored in struct kvm_arch_memory_slot,
-	 * shadow pages deletion (e.g. unaccount_shadowed()) requires that all
-	 * gfns with a shadow page have a corresponding memslot.  Do so before
-	 * the memslot goes away.
+	 * all MMU pages that are shadowing guest PTEs must be zapped before the
+	 * memslot is deleted, as freeing such pages after the memslot is freed
+	 * will result in use-after-free, e.g. in unaccount_shadowed().
 	 */
 	for (i = 0; i < slot->npages; i++) {
 		struct kvm_mmu_page *sp;
 		gfn_t gfn = slot->base_gfn + i;
 
-		for_each_gfn_valid_sp(kvm, sp, gfn)
+		for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn)
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {