diff mbox

[11/11] KVM: MMU: improve write flooding detected

Message ID 4E2EA5D2.8040804@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Xiao Guangrong July 26, 2011, 11:32 a.m. UTC
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
or not, if the spte is not accessed but it is written frequently, we treat is
not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/include/asm/kvm_host.h |    6 +---
 arch/x86/kvm/mmu.c              |   53 +++++++++------------------------------
 arch/x86/kvm/paging_tmpl.h      |    9 +-----
 3 files changed, 16 insertions(+), 52 deletions(-)

Comments

Avi Kivity July 27, 2011, 9:23 a.m. UTC | #1
On 07/26/2011 02:32 PM, Xiao Guangrong wrote:
> Detecting write-flooding does not work well, when we handle page written, if
> the last speculative spte is not accessed, we treat the page is
> write-flooding, however, we can speculative spte on many path, such as pte
> prefetch, page synced, that means the last speculative spte may be not point
> to the written page and the written page can be accessed via other sptes, so
> depends on the Accessed bit of the last speculative spte is not enough
>
> Instead of detected page accessed, we can detect whether the spte is accessed
> or not, if the spte is not accessed but it is written frequently, we treat is
> not a page table or it not used for a long time
>
>   static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
>   {
>   	struct kvm_mmu_memory_cache *cache;
> @@ -3565,22 +3547,14 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
>    * If we're seeing too many writes to a page, it may no longer be a page table,
>    * or we may be forking, in which case it is better to unmap the page.
>    */
> -static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
> +static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
>   {
> -	bool flooded = false;
> -
> -	if (gfn == vcpu->arch.last_pt_write_gfn
> -	&&  !last_updated_pte_accessed(vcpu)) {
> -		++vcpu->arch.last_pt_write_count;
> -		if (vcpu->arch.last_pt_write_count>= 3)
> -			flooded = true;
> -	} else {
> -		vcpu->arch.last_pt_write_gfn = gfn;
> -		vcpu->arch.last_pt_write_count = 1;
> -		vcpu->arch.last_pte_updated = NULL;
> -	}
> +	if (spte&&  !(*spte&  shadow_accessed_mask))
> +		sp->write_flooding_count++;
> +	else
> +		sp->write_flooding_count = 0;
>
> -	return flooded;
> +	return sp->write_flooding_count>= 3;
>   }

I think this is a little dangerous.  A guest kernel may be instantiating 
multiple gptes on a page fault, but guest userspace hits only one of 
them (the one which caused the page fault) - I think Windows does this, 
but I'm not sure.

Maybe we should inspect parent_ptes instead?
Xiao Guangrong July 27, 2011, 10:20 a.m. UTC | #2
On 07/27/2011 05:23 PM, Avi Kivity wrote:
> On 07/26/2011 02:32 PM, Xiao Guangrong wrote:
>> Detecting write-flooding does not work well, when we handle page written, if
>> the last speculative spte is not accessed, we treat the page is
>> write-flooding, however, we can speculative spte on many path, such as pte
>> prefetch, page synced, that means the last speculative spte may be not point
>> to the written page and the written page can be accessed via other sptes, so
>> depends on the Accessed bit of the last speculative spte is not enough
>>
>> Instead of detected page accessed, we can detect whether the spte is accessed
>> or not, if the spte is not accessed but it is written frequently, we treat is
>> not a page table or it not used for a long time
>>
>>   static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
>>   {
>>       struct kvm_mmu_memory_cache *cache;
>> @@ -3565,22 +3547,14 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
>>    * If we're seeing too many writes to a page, it may no longer be a page table,
>>    * or we may be forking, in which case it is better to unmap the page.
>>    */
>> -static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
>> +static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
>>   {
>> -    bool flooded = false;
>> -
>> -    if (gfn == vcpu->arch.last_pt_write_gfn
>> -    &&  !last_updated_pte_accessed(vcpu)) {
>> -        ++vcpu->arch.last_pt_write_count;
>> -        if (vcpu->arch.last_pt_write_count>= 3)
>> -            flooded = true;
>> -    } else {
>> -        vcpu->arch.last_pt_write_gfn = gfn;
>> -        vcpu->arch.last_pt_write_count = 1;
>> -        vcpu->arch.last_pte_updated = NULL;
>> -    }
>> +    if (spte&&  !(*spte&  shadow_accessed_mask))
>> +        sp->write_flooding_count++;
>> +    else
>> +        sp->write_flooding_count = 0;
>>
>> -    return flooded;
>> +    return sp->write_flooding_count>= 3;
>>   }
> 
> I think this is a little dangerous.  A guest kernel may be instantiating multiple gptes on a page fault, but guest userspace hits only one of them (the one which caused the page fault) - I think Windows does this, but I'm not sure.
> 

I think this case is not bad: if the guest kernel need to write multiple gptes (>=3),
it will cause many page fault, we do better zap the shadow page and let it become writable as
soon as possible.
(And, we have pte-fetch, it can quickly establish the mapping for a new shadow page)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity July 27, 2011, 11:08 a.m. UTC | #3
On 07/27/2011 01:20 PM, Xiao Guangrong wrote:
> >>    }
> >
> >  I think this is a little dangerous.  A guest kernel may be instantiating multiple gptes on a page fault, but guest userspace hits only one of them (the one which caused the page fault) - I think Windows does this, but I'm not sure.
> >
>
> I think this case is not bad: if the guest kernel need to write multiple gptes (>=3),
> it will cause many page fault, we do better zap the shadow page and let it become writable as
> soon as possible.
> (And, we have pte-fetch, it can quickly establish the mapping for a new shadow page)

Actually, what should save us is unsync pages.  Why are we hitting this 
path at all?
Xiao Guangrong July 28, 2011, 2:43 a.m. UTC | #4
On 07/27/2011 07:08 PM, Avi Kivity wrote:
> On 07/27/2011 01:20 PM, Xiao Guangrong wrote:
>> >>    }
>> >
>> >  I think this is a little dangerous.  A guest kernel may be instantiating multiple gptes on a page fault, but guest userspace hits only one of them (the one which caused the page fault) - I think Windows does this, but I'm not sure.
>> >
>>
>> I think this case is not bad: if the guest kernel need to write multiple gptes (>=3),
>> it will cause many page fault, we do better zap the shadow page and let it become writable as
>> soon as possible.
>> (And, we have pte-fetch, it can quickly establish the mapping for a new shadow page)
> 
> Actually, what should save us is unsync pages.  Why are we hitting this path at all?
> 

Avi,

The shadow page can not became unsync if it has other sp that sp.gfn = gfn && sp.role.level != 1,
for example:
- if the gfn is not only used for the last page structure(PTE page)
or
- gfn was used for upper page structure before but we do not zap the old shadow pages

So, if this gfn is written, #PF is generated, we hope that these sp can be zapped earlier,
the later #PF can detect this gfn is not have shadow pages, and the mapping can became writable.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ce17642..8c938db 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -239,6 +239,8 @@  struct kvm_mmu_page {
 	int clear_spte_count;
 #endif
 
+	int write_flooding_count;
+
 	struct rcu_head rcu;
 };
 
@@ -352,10 +354,6 @@  struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_page_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
-	gfn_t last_pt_write_gfn;
-	int   last_pt_write_count;
-	u64  *last_pte_updated;
-
 	struct fpu guest_fpu;
 	u64 xcr0;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bb55b15..8c2885c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1688,6 +1688,7 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		} else if (sp->unsync)
 			kvm_mmu_mark_parents_unsync(sp);
 
+		sp->write_flooding_count = 0;
 		trace_kvm_mmu_get_page(sp, false);
 		return sp;
 	}
@@ -1840,15 +1841,6 @@  static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
 	mmu_page_remove_parent_pte(sp, parent_pte);
 }
 
-static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
-{
-	int i;
-	struct kvm_vcpu *vcpu;
-
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		vcpu->arch.last_pte_updated = NULL;
-}
-
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	u64 *parent_pte;
@@ -1908,7 +1900,6 @@  static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 	}
 
 	sp->role.invalid = 1;
-	kvm_mmu_reset_last_pte_updated(kvm);
 	return ret;
 }
 
@@ -2350,8 +2341,6 @@  static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		}
 	}
 	kvm_release_pfn_clean(pfn);
-	if (speculative)
-		vcpu->arch.last_pte_updated = sptep;
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -3509,13 +3498,6 @@  static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, bool zap_page,
 		kvm_mmu_flush_tlb(vcpu);
 }
 
-static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
-{
-	u64 *spte = vcpu->arch.last_pte_updated;
-
-	return !!(spte && (*spte & shadow_accessed_mask));
-}
-
 static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu_memory_cache *cache;
@@ -3565,22 +3547,14 @@  static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
  * If we're seeing too many writes to a page, it may no longer be a page table,
  * or we may be forking, in which case it is better to unmap the page.
  */
-static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
+static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
 {
-	bool flooded = false;
-
-	if (gfn == vcpu->arch.last_pt_write_gfn
-	    && !last_updated_pte_accessed(vcpu)) {
-		++vcpu->arch.last_pt_write_count;
-		if (vcpu->arch.last_pt_write_count >= 3)
-			flooded = true;
-	} else {
-		vcpu->arch.last_pt_write_gfn = gfn;
-		vcpu->arch.last_pt_write_count = 1;
-		vcpu->arch.last_pte_updated = NULL;
-	}
+	if (spte && !(*spte & shadow_accessed_mask))
+		sp->write_flooding_count++;
+	else
+		sp->write_flooding_count = 0;
 
-	return flooded;
+	return sp->write_flooding_count >= 3;
 }
 
 /*
@@ -3663,7 +3637,7 @@  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	LIST_HEAD(invalid_list);
 	u64 entry, gentry, *spte;
 	int npte;
-	bool remote_flush, local_flush, zap_page, flooded;
+	bool remote_flush, local_flush, zap_page;
 
 	/*
 	 * If we don't have indirect shadow pages, it means no page is
@@ -3682,21 +3656,18 @@  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	++vcpu->kvm->stat.mmu_pte_write;
 	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
 
-	flooded = detect_write_flooding(vcpu, gfn);
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn, node) {
-		bool mismatch, misaligned;
-
-		misaligned = detect_write_misaligned(sp, gpa, bytes);
-		mismatch = detect_mismatch_sp(vcpu, sp);
+		spte = get_written_sptes(sp, gpa, &npte);
 
-		if (misaligned || mismatch || flooded || repeat_write) {
+		if (repeat_write || detect_write_misaligned(sp, gpa, bytes) ||
+		      detect_write_flooding(sp, spte) ||
+		      detect_mismatch_sp(vcpu, sp)) {
 			zap_page |= !!kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 						     &invalid_list);
 			++vcpu->kvm->stat.mmu_flooded;
 			continue;
 		}
 
-		spte = get_written_sptes(sp, gpa, &npte);
 		if (!spte)
 			continue;
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 3466229..82063b2 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -595,11 +595,9 @@  static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	 */
 	if (!r) {
 		pgprintk("%s: guest page fault\n", __func__);
-		if (!prefault) {
+		if (!prefault)
 			inject_page_fault(vcpu, &walker.fault);
-			/* reset fork detector */
-			vcpu->arch.last_pt_write_count = 0;
-		}
+
 		return 0;
 	}
 
@@ -637,9 +635,6 @@  static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	pgprintk("%s: shadow pte %p %llx emulate %d\n", __func__,
 		 sptep, *sptep, emulate);
 
-	if (!emulate)
-		vcpu->arch.last_pt_write_count = 0; /* reset fork detector */
-
 	++vcpu->stat.pf_fixed;
 	trace_kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 	spin_unlock(&vcpu->kvm->mmu_lock);