diff mbox

[11/11] KVM: MMU: improve write flooding detected

Message ID 4E4A1257.5080204@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Xiao Guangrong Aug. 16, 2011, 6:46 a.m. UTC
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
or not, if the spte is not accessed but it is written frequently, we treat is
not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
---
 arch/x86/include/asm/kvm_host.h |    6 +---
 arch/x86/kvm/mmu.c              |   48 +++++++++------------------------------
 arch/x86/kvm/paging_tmpl.h      |    9 +-----
 3 files changed, 15 insertions(+), 48 deletions(-)

Comments

Marcelo Tosatti Aug. 23, 2011, 8 a.m. UTC | #1
On Tue, Aug 16, 2011 at 02:46:47PM +0800, Xiao Guangrong wrote:
> Detecting write-flooding does not work well, when we handle page written, if
> the last speculative spte is not accessed, we treat the page is
> write-flooding, however, we can speculative spte on many path, such as pte
> prefetch, page synced, that means the last speculative spte may be not point
> to the written page and the written page can be accessed via other sptes, so
> depends on the Accessed bit of the last speculative spte is not enough

Yes, a stale last_speculative_spte is possible, but is this fact a
noticeable problem in practice?

Was this detected by code inspection?

> Instead of detected page accessed, we can detect whether the spte is accessed
> or not, if the spte is not accessed but it is written frequently, we treat is
> not a page table or it not used for a long time
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    6 +---
>  arch/x86/kvm/mmu.c              |   48 +++++++++------------------------------
>  arch/x86/kvm/paging_tmpl.h      |    9 +-----
>  3 files changed, 15 insertions(+), 48 deletions(-)
>
 
> -static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
> +static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
>  {
> -	bool flooded = false;
> -
> -	if (gfn == vcpu->arch.last_pt_write_gfn
> -	    && !last_updated_pte_accessed(vcpu)) {
> -		++vcpu->arch.last_pt_write_count;
> -		if (vcpu->arch.last_pt_write_count >= 3)
> -			flooded = true;
> -	} else {
> -		vcpu->arch.last_pt_write_gfn = gfn;
> -		vcpu->arch.last_pt_write_count = 1;
> -		vcpu->arch.last_pte_updated = NULL;
> -	}
> +	if (spte && !(*spte & shadow_accessed_mask))
> +		sp->write_flooding_count++;
> +	else
> +		sp->write_flooding_count = 0;

This relies on the sptes being created by speculative means
or by pressure on the host clearing the accessed bit for the
shadow page to be zapped. 

There is no guarantee that either of these is true for a given
spte.

And if the sptes do not have accessed bit set, any nonconsecutive 3 pte
updates will zap the page.

Back to the first question, what is the motivation for this heuristic
change? Do you have any numbers?

If its a significant problem, perhaps getting rid of the
'last_spte_accessed' part is enough.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Aug. 23, 2011, 10:55 a.m. UTC | #2
Hi Marcelo,

On 08/23/2011 04:00 PM, Marcelo Tosatti wrote:
> On Tue, Aug 16, 2011 at 02:46:47PM +0800, Xiao Guangrong wrote:
>> Detecting write-flooding does not work well, when we handle page written, if
>> the last speculative spte is not accessed, we treat the page is
>> write-flooding, however, we can speculative spte on many path, such as pte
>> prefetch, page synced, that means the last speculative spte may be not point
>> to the written page and the written page can be accessed via other sptes, so
>> depends on the Accessed bit of the last speculative spte is not enough
> 
> Yes, a stale last_speculative_spte is possible, but is this fact a
> noticeable problem in practice?
> 
> Was this detected by code inspection?
> 

I detected this because: i noticed some shadow page is zapped by
write-flooding but it is accessed soon, it causes the shadow page zapped
and alloced again and again(very frequently).

Another reason is that: in current code, write-flooding is little complex
and it stuffs code in many places, actually, write-flooding is only needed for
shadow page/nested guest, so i want to simplify it and wrap its code up.

>> -	}
>> +	if (spte && !(*spte & shadow_accessed_mask))
>> +		sp->write_flooding_count++;
>> +	else
>> +		sp->write_flooding_count = 0;
> 
> This relies on the sptes being created by speculative means
> or by pressure on the host clearing the accessed bit for the
> shadow page to be zapped. 
> 
> There is no guarantee that either of these is true for a given
> spte.
> 
> And if the sptes do not have accessed bit set, any nonconsecutive 3 pte
> updates will zap the page.
> 

Please note we clear 'sp->write_flooding_count' when it is accessed from
shadow page cache (in kvm_mmu_get_page), it means if any spte of sp generates
#PF, the fooding count can be reset.

And, i think there are not problems since: if the spte without accssed bit is
written frequently, it means the guest page table is accessed infrequently or
during the writing, the guest page table is not accessed, in this time, zapping
this shadow page is not bad.

Comparing the old way, the advantage of it is good for zapping upper shadow page,
for example, in the old way:
if a gfn is used as PDE for a task, later, the gfn is freed and used as PTE for
the new task, so we have two shadow pages in the host, one sp1.level = 2 and the
other sp2.level = 1. So, when we detect write-flooding, the vcpu->last_pte_updated
always point to sp2.pte. As sp2 is used for the new task, we always detected both
shadow pages are bing used, but actually, sp1 is not used by guest anymore.

> Back to the first question, what is the motivation for this heuristic
> change? Do you have any numbers?
> 

Yes, i have done the quick test:

before this patch:
2m56.561
2m50.651
2m51.220
2m52.199
2m48.066

After this patch:
2m51.194
2m55.980
2m50.755
2m47.396
2m46.807

It shows the new way is little better than the old way.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Aug. 23, 2011, 12:38 p.m. UTC | #3
On Tue, Aug 23, 2011 at 06:55:39PM +0800, Xiao Guangrong wrote:
> Hi Marcelo,
> 
> On 08/23/2011 04:00 PM, Marcelo Tosatti wrote:
> > On Tue, Aug 16, 2011 at 02:46:47PM +0800, Xiao Guangrong wrote:
> >> Detecting write-flooding does not work well, when we handle page written, if
> >> the last speculative spte is not accessed, we treat the page is
> >> write-flooding, however, we can speculative spte on many path, such as pte
> >> prefetch, page synced, that means the last speculative spte may be not point
> >> to the written page and the written page can be accessed via other sptes, so
> >> depends on the Accessed bit of the last speculative spte is not enough
> > 
> > Yes, a stale last_speculative_spte is possible, but is this fact a
> > noticeable problem in practice?
> > 
> > Was this detected by code inspection?
> > 
> 
> I detected this because: i noticed some shadow page is zapped by
> write-flooding but it is accessed soon, it causes the shadow page zapped
> and alloced again and again(very frequently).
> 
> Another reason is that: in current code, write-flooding is little complex
> and it stuffs code in many places, actually, write-flooding is only needed for
> shadow page/nested guest, so i want to simplify it and wrap its code up.
> 
> >> -	}
> >> +	if (spte && !(*spte & shadow_accessed_mask))
> >> +		sp->write_flooding_count++;
> >> +	else
> >> +		sp->write_flooding_count = 0;
> > 
> > This relies on the sptes being created by speculative means
> > or by pressure on the host clearing the accessed bit for the
> > shadow page to be zapped. 
> > 
> > There is no guarantee that either of these is true for a given
> > spte.
> > 
> > And if the sptes do not have accessed bit set, any nonconsecutive 3 pte
> > updates will zap the page.
> > 
> 
> Please note we clear 'sp->write_flooding_count' when it is accessed from
> shadow page cache (in kvm_mmu_get_page), it means if any spte of sp generates
> #PF, the fooding count can be reset.

OK.

> And, i think there are not problems since: if the spte without accssed bit is
> written frequently, it means the guest page table is accessed infrequently or
> during the writing, the guest page table is not accessed, in this time, zapping
> this shadow page is not bad.

Think of the following scenario:

1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
is not increased.
3) repeat

So you cannot rely on the accessed bit being cleared to zap the shadow
page, because it might not be cleared in certain scenarios.

> Comparing the old way, the advantage of it is good for zapping upper shadow page,
> for example, in the old way:
> if a gfn is used as PDE for a task, later, the gfn is freed and used as PTE for
> the new task, so we have two shadow pages in the host, one sp1.level = 2 and the
> other sp2.level = 1. So, when we detect write-flooding, the vcpu->last_pte_updated
> always point to sp2.pte. As sp2 is used for the new task, we always detected both
> shadow pages are bing used, but actually, sp1 is not used by guest anymore.

Makes sense.

> > Back to the first question, what is the motivation for this heuristic
> > change? Do you have any numbers?
> > 
> 
> Yes, i have done the quick test:
> 
> before this patch:
> 2m56.561
> 2m50.651
> 2m51.220
> 2m52.199
> 2m48.066
> 
> After this patch:
> 2m51.194
> 2m55.980
> 2m50.755
> 2m47.396
> 2m46.807
> 
> It shows the new way is little better than the old way.

What test is this?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Aug. 23, 2011, 4:32 p.m. UTC | #4
On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:

>> And, i think there are not problems since: if the spte without accssed bit is
>> written frequently, it means the guest page table is accessed infrequently or
>> during the writing, the guest page table is not accessed, in this time, zapping
>> this shadow page is not bad.
> 
> Think of the following scenario:
> 
> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> is not increased.
> 3) repeat
> 

I think the result is just we hoped, we do not want to zap the shadow page
because the spte is currently used by the guest, it also will be used in the
next repetition. So do not increase 'write_flooding_count' is a good choice.

Let's consider what will happen if we increase 'write_flooding_count':
1: after three repetitions, zap the shadow page
2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
3: in step 2, the flooding count is creased, so after 3 repetitions, the
   shadow page can be zapped again, repeat 1 to 3.

The result is the shadow page for gfnA is alloced and zapped again and again,
yes?

> So you cannot rely on the accessed bit being cleared to zap the shadow
> page, because it might not be cleared in certain scenarios.
> 
>> Comparing the old way, the advantage of it is good for zapping upper shadow page,
>> for example, in the old way:
>> if a gfn is used as PDE for a task, later, the gfn is freed and used as PTE for
>> the new task, so we have two shadow pages in the host, one sp1.level = 2 and the
>> other sp2.level = 1. So, when we detect write-flooding, the vcpu->last_pte_updated
>> always point to sp2.pte. As sp2 is used for the new task, we always detected both
>> shadow pages are bing used, but actually, sp1 is not used by guest anymore.
> 
> Makes sense.
> 
>>> Back to the first question, what is the motivation for this heuristic
>>> change? Do you have any numbers?
>>>
>>
>> Yes, i have done the quick test:
>>
>> before this patch:
>> 2m56.561
>> 2m50.651
>> 2m51.220
>> 2m52.199
>> 2m48.066
>>
>> After this patch:
>> 2m51.194
>> 2m55.980
>> 2m50.755
>> 2m47.396
>> 2m46.807
>>
>> It shows the new way is little better than the old way.
> 
> What test is this?
> 

Sorry, i forgot to mention it, the test case is kerbench. :-)
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Aug. 23, 2011, 7:09 p.m. UTC | #5
On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
> 
> >> And, i think there are not problems since: if the spte without accssed bit is
> >> written frequently, it means the guest page table is accessed infrequently or
> >> during the writing, the guest page table is not accessed, in this time, zapping
> >> this shadow page is not bad.
> > 
> > Think of the following scenario:
> > 
> > 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> > 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> > is not increased.
> > 3) repeat
> > 
> 
> I think the result is just we hoped, we do not want to zap the shadow page
> because the spte is currently used by the guest, it also will be used in the
> next repetition. So do not increase 'write_flooding_count' is a good choice.

Its not used. Step 2) is write to write protected shadow page at
gfnA.

> Let's consider what will happen if we increase 'write_flooding_count':
> 1: after three repetitions, zap the shadow page
> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
>    shadow page can be zapped again, repeat 1 to 3.

The shadow page will not be zapped because the spte created from
gfnA+indexA has the accessed bit set:

       if (spte && !(*spte & shadow_accessed_mask))
               sp->write_flooding_count++;
       else
               sp->write_flooding_count = 0;

> The result is the shadow page for gfnA is alloced and zapped again and again,
> yes?

The point is you cannot rely on the accessed bit of sptes that have been
instantiated with the accessed bit set to decide whether or not to zap.
Because the accessed bit will only be cleared on host memory pressure.

> > So you cannot rely on the accessed bit being cleared to zap the shadow
> > page, because it might not be cleared in certain scenarios.
> > 
> >> Comparing the old way, the advantage of it is good for zapping upper shadow page,
> >> for example, in the old way:
> >> if a gfn is used as PDE for a task, later, the gfn is freed and used as PTE for
> >> the new task, so we have two shadow pages in the host, one sp1.level = 2 and the
> >> other sp2.level = 1. So, when we detect write-flooding, the vcpu->last_pte_updated
> >> always point to sp2.pte. As sp2 is used for the new task, we always detected both
> >> shadow pages are bing used, but actually, sp1 is not used by guest anymore.
> > 
> > Makes sense.
> > 
> >>> Back to the first question, what is the motivation for this heuristic
> >>> change? Do you have any numbers?
> >>>
> >>
> >> Yes, i have done the quick test:
> >>
> >> before this patch:
> >> 2m56.561
> >> 2m50.651
> >> 2m51.220
> >> 2m52.199
> >> 2m48.066
> >>
> >> After this patch:
> >> 2m51.194
> >> 2m55.980
> >> 2m50.755
> >> 2m47.396
> >> 2m46.807
> >>
> >> It shows the new way is little better than the old way.
> > 
> > What test is this?
> > 
> 
> Sorry, i forgot to mention it, the test case is kerbench. :-)
>  
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Aug. 23, 2011, 8:16 p.m. UTC | #6
On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
>> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
>>
>>>> And, i think there are not problems since: if the spte without accssed bit is
>>>> written frequently, it means the guest page table is accessed infrequently or
>>>> during the writing, the guest page table is not accessed, in this time, zapping
>>>> this shadow page is not bad.
>>>
>>> Think of the following scenario:
>>>
>>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
>>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
>>> is not increased.
>>> 3) repeat
>>>
>>
>> I think the result is just we hoped, we do not want to zap the shadow page
>> because the spte is currently used by the guest, it also will be used in the
>> next repetition. So do not increase 'write_flooding_count' is a good choice.
> 
> Its not used. Step 2) is write to write protected shadow page at
> gfnA.
> 
>> Let's consider what will happen if we increase 'write_flooding_count':
>> 1: after three repetitions, zap the shadow page
>> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
>> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
>>    shadow page can be zapped again, repeat 1 to 3.
> 
> The shadow page will not be zapped because the spte created from
> gfnA+indexA has the accessed bit set:
> 
>        if (spte && !(*spte & shadow_accessed_mask))
>                sp->write_flooding_count++;
>        else
>                sp->write_flooding_count = 0;
> 

Ah, i see, i thought it was "repeat"ed on the same spte, it was my wrong.

Yes, in this case, the sp is not zapped, but it is hardly to know the gfn
is not used as gpte just depends on writing, for example, the guest can
change the mapping address or the status bit, and so on...The sp can be
zapped if the guest write it again(on the same address), i think it is
acceptable, anymore, it is just the speculative way to zap the unused
shadow page...your opinion?

>> The result is the shadow page for gfnA is alloced and zapped again and again,
>> yes?
> 
> The point is you cannot rely on the accessed bit of sptes that have been
> instantiated with the accessed bit set to decide whether or not to zap.
> Because the accessed bit will only be cleared on host memory pressure.
> 

Yes, accessed bit is the cursory way to track gpte accessed, however,
at least, the accessed bit can indicate whether the gfn is accessed
for a period of time in the most case, for example, from it is
speculated to it is written, or from it is zapped to it is written,
i thinks it is not too bad.

Do you have ideas to improve this?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Aug. 24, 2011, 8:05 p.m. UTC | #7
On Wed, Aug 24, 2011 at 04:16:52AM +0800, Xiao Guangrong wrote:
> On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> > On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
> >> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
> >>
> >>>> And, i think there are not problems since: if the spte without accssed bit is
> >>>> written frequently, it means the guest page table is accessed infrequently or
> >>>> during the writing, the guest page table is not accessed, in this time, zapping
> >>>> this shadow page is not bad.
> >>>
> >>> Think of the following scenario:
> >>>
> >>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> >>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> >>> is not increased.
> >>> 3) repeat
> >>>
> >>
> >> I think the result is just we hoped, we do not want to zap the shadow page
> >> because the spte is currently used by the guest, it also will be used in the
> >> next repetition. So do not increase 'write_flooding_count' is a good choice.
> > 
> > Its not used. Step 2) is write to write protected shadow page at
> > gfnA.
> > 
> >> Let's consider what will happen if we increase 'write_flooding_count':
> >> 1: after three repetitions, zap the shadow page
> >> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
> >> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
> >>    shadow page can be zapped again, repeat 1 to 3.
> > 
> > The shadow page will not be zapped because the spte created from
> > gfnA+indexA has the accessed bit set:
> > 
> >        if (spte && !(*spte & shadow_accessed_mask))
> >                sp->write_flooding_count++;
> >        else
> >                sp->write_flooding_count = 0;
> > 
> 
> Ah, i see, i thought it was "repeat"ed on the same spte, it was my wrong.
> 
> Yes, in this case, the sp is not zapped, but it is hardly to know the gfn
> is not used as gpte just depends on writing, for example, the guest can
> change the mapping address or the status bit, and so on...The sp can be
> zapped if the guest write it again(on the same address), i think it is
> acceptable, anymore, it is just the speculative way to zap the unused
> shadow page...your opinion?

It could increase the flood count independently of the accessed bit of
the spte being updated, zapping after 3 attempts as it is now.

But additionally reset the flood count if the gpte appears to be valid
(points to an existant gfn if the present bit is set, or if its zeroed).

> >> The result is the shadow page for gfnA is alloced and zapped again and again,
> >> yes?
> > 
> > The point is you cannot rely on the accessed bit of sptes that have been
> > instantiated with the accessed bit set to decide whether or not to zap.
> > Because the accessed bit will only be cleared on host memory pressure.
> > 
> 
> Yes, accessed bit is the cursory way to track gpte accessed, however,
> at least, the accessed bit can indicate whether the gfn is accessed
> for a period of time in the most case, for example, from it is
> speculated to it is written, or from it is zapped to it is written,
> i thinks it is not too bad.
> 
> Do you have ideas to improve this?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Aug. 25, 2011, 2:04 a.m. UTC | #8
On Wed, Aug 24, 2011 at 05:05:40PM -0300, Marcelo Tosatti wrote:
> On Wed, Aug 24, 2011 at 04:16:52AM +0800, Xiao Guangrong wrote:
> > On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> > > On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
> > >> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
> > >>
> > >>>> And, i think there are not problems since: if the spte without accssed bit is
> > >>>> written frequently, it means the guest page table is accessed infrequently or
> > >>>> during the writing, the guest page table is not accessed, in this time, zapping
> > >>>> this shadow page is not bad.
> > >>>
> > >>> Think of the following scenario:
> > >>>
> > >>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> > >>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> > >>> is not increased.
> > >>> 3) repeat
> > >>>
> > >>
> > >> I think the result is just we hoped, we do not want to zap the shadow page
> > >> because the spte is currently used by the guest, it also will be used in the
> > >> next repetition. So do not increase 'write_flooding_count' is a good choice.
> > > 
> > > Its not used. Step 2) is write to write protected shadow page at
> > > gfnA.
> > > 
> > >> Let's consider what will happen if we increase 'write_flooding_count':
> > >> 1: after three repetitions, zap the shadow page
> > >> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
> > >> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
> > >>    shadow page can be zapped again, repeat 1 to 3.
> > > 
> > > The shadow page will not be zapped because the spte created from
> > > gfnA+indexA has the accessed bit set:
> > > 
> > >        if (spte && !(*spte & shadow_accessed_mask))
> > >                sp->write_flooding_count++;
> > >        else
> > >                sp->write_flooding_count = 0;
> > > 
> > 
> > Ah, i see, i thought it was "repeat"ed on the same spte, it was my wrong.
> > 
> > Yes, in this case, the sp is not zapped, but it is hardly to know the gfn
> > is not used as gpte just depends on writing, for example, the guest can
> > change the mapping address or the status bit, and so on...The sp can be
> > zapped if the guest write it again(on the same address), i think it is
> > acceptable, anymore, it is just the speculative way to zap the unused
> > shadow page...your opinion?
> 
> It could increase the flood count independently of the accessed bit of
> the spte being updated, zapping after 3 attempts as it is now.
> 
> But additionally reset the flood count if the gpte appears to be valid
> (points to an existant gfn if the present bit is set, or if its zeroed).

Well not zero, as thats a common pattern for non ptes.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Aug. 25, 2011, 4:42 a.m. UTC | #9
On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:
> >
> >  It could increase the flood count independently of the accessed bit of
> >  the spte being updated, zapping after 3 attempts as it is now.
> >
> >  But additionally reset the flood count if the gpte appears to be valid
> >  (points to an existant gfn if the present bit is set, or if its zeroed).
>
> Well not zero, as thats a common pattern for non ptes.
>

On 32-bit with 4GB RAM, practically anything is a valid gpte.
Xiao Guangrong Aug. 25, 2011, 7:40 a.m. UTC | #10
On 08/25/2011 10:04 AM, Marcelo Tosatti wrote:

>>> Yes, in this case, the sp is not zapped, but it is hardly to know the gfn
>>> is not used as gpte just depends on writing, for example, the guest can
>>> change the mapping address or the status bit, and so on...The sp can be
>>> zapped if the guest write it again(on the same address), i think it is
>>> acceptable, anymore, it is just the speculative way to zap the unused
>>> shadow page...your opinion?
>>
>> It could increase the flood count independently of the accessed bit of
>> the spte being updated, zapping after 3 attempts as it is now.
>>
>> But additionally reset the flood count if the gpte appears to be valid
>> (points to an existant gfn if the present bit is set, or if its zeroed).
> 
> Well not zero, as thats a common pattern for non ptes.
> 

Hi Marcelo,

Maybe it is not good i think, for some reasons:
- checking gfn valid which it is pointed by gpte is high overload,
  it needs to call gfn_to_hva to walk memslots, especially. kvm_mmu_pte_write
  is called very frequently on shadow mmu.

- MMIO gfn is not an existent gfn, but it is valid pointed by gpte

- we can check the reserved bits in the gpte to check whether it is valid a
  gpte, but for some paging modes, all bits are valid.(for example, non-PAE mode)

- it can not work if the gfn has multiple shadow pages, for example:
  if the gfn was used as PDE, later it is used as PTE, then we have two shadow
  pages: sp1.level = 2, sp2.level = 1, sp1 can not be zapped even even though it
  is not used anymore.

- sometime, we need to zap the shadow page even though the gpte is written validly:
  if the gpte is written frequently but infrequently accessed, we do better zap the
  shadow page to let it is writable(write it directly without #PF) and map it when it
  is accessed, one example is from Avi, the guest OS may update many gptes at one time
  after one page fault.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong Aug. 25, 2011, 7:57 a.m. UTC | #11
On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
>> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
>>
>>>> And, i think there are not problems since: if the spte without accssed bit is
>>>> written frequently, it means the guest page table is accessed infrequently or
>>>> during the writing, the guest page table is not accessed, in this time, zapping
>>>> this shadow page is not bad.
>>>
>>> Think of the following scenario:
>>>
>>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
>>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
>>> is not increased.
>>> 3) repeat
>>>
>>
>> I think the result is just we hoped, we do not want to zap the shadow page
>> because the spte is currently used by the guest, it also will be used in the
>> next repetition. So do not increase 'write_flooding_count' is a good choice.
> 
> Its not used. Step 2) is write to write protected shadow page at
> gfnA.
> 
>> Let's consider what will happen if we increase 'write_flooding_count':
>> 1: after three repetitions, zap the shadow page
>> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
>> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
>>    shadow page can be zapped again, repeat 1 to 3.
> 
> The shadow page will not be zapped because the spte created from
> gfnA+indexA has the accessed bit set:
> 
>        if (spte && !(*spte & shadow_accessed_mask))
>                sp->write_flooding_count++;
>        else
>                sp->write_flooding_count = 0;
> 

Marcelo, i am still confused with your example, in step 3), what is repeated?
it repeats step 2) or it repeats step 1) and 2)?

Only step 2) is repeated i guess, right? if it is yes, it works well:
when the guest writes gpte, the spte of corresponding shadow page is zapped
(level > 1) or it is speculatively fetched(level == 1), the accessed bit is
cleared in both case.

the later write can detect that the accessed bit is not set, and write_flooding_count
is increased. finally, the shadow page is zapped, the gpte is written directly.

>> The result is the shadow page for gfnA is alloced and zapped again and again,
>> yes?
> 
> The point is you cannot rely on the accessed bit of sptes that have been
> instantiated with the accessed bit set to decide whether or not to zap.
> Because the accessed bit will only be cleared on host memory pressure.
> 

But the accessed bit is also cleared after spte is written.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Aug. 25, 2011, 1:21 p.m. UTC | #12
On Thu, Aug 25, 2011 at 07:42:10AM +0300, Avi Kivity wrote:
> On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:
> >>
> >>  It could increase the flood count independently of the accessed bit of
> >>  the spte being updated, zapping after 3 attempts as it is now.
> >>
> >>  But additionally reset the flood count if the gpte appears to be valid
> >>  (points to an existant gfn if the present bit is set, or if its zeroed).
> >
> >Well not zero, as thats a common pattern for non ptes.
> >
> 
> On 32-bit with 4GB RAM, practically anything is a valid gpte.

The following could be required to consider a valid gpte, for write
flood detection purposes:

- Must be present.
- PageCacheDisable must be unset.
- PageWriteThrough must be unset.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti Aug. 25, 2011, 1:47 p.m. UTC | #13
On Thu, Aug 25, 2011 at 03:57:22PM +0800, Xiao Guangrong wrote:
> On 08/24/2011 03:09 AM, Marcelo Tosatti wrote:
> > On Wed, Aug 24, 2011 at 12:32:32AM +0800, Xiao Guangrong wrote:
> >> On 08/23/2011 08:38 PM, Marcelo Tosatti wrote:
> >>
> >>>> And, i think there are not problems since: if the spte without accssed bit is
> >>>> written frequently, it means the guest page table is accessed infrequently or
> >>>> during the writing, the guest page table is not accessed, in this time, zapping
> >>>> this shadow page is not bad.
> >>>
> >>> Think of the following scenario:
> >>>
> >>> 1) page fault, spte with accessed bit is created from gpte at gfnA+indexA.
> >>> 2) write to gfnA+indexA, spte has accessed bit set, write_flooding_count
> >>> is not increased.
> >>> 3) repeat
> >>>
> >>
> >> I think the result is just we hoped, we do not want to zap the shadow page
> >> because the spte is currently used by the guest, it also will be used in the
> >> next repetition. So do not increase 'write_flooding_count' is a good choice.
> > 
> > Its not used. Step 2) is write to write protected shadow page at
> > gfnA.
> > 
> >> Let's consider what will happen if we increase 'write_flooding_count':
> >> 1: after three repetitions, zap the shadow page
> >> 2: in step 1, we will alloc a new shadow page for gpte at gfnA+indexA
> >> 3: in step 2, the flooding count is creased, so after 3 repetitions, the
> >>    shadow page can be zapped again, repeat 1 to 3.
> > 
> > The shadow page will not be zapped because the spte created from
> > gfnA+indexA has the accessed bit set:
> > 
> >        if (spte && !(*spte & shadow_accessed_mask))
> >                sp->write_flooding_count++;
> >        else
> >                sp->write_flooding_count = 0;
> > 
> 
> Marcelo, i am still confused with your example, in step 3), what is repeated?
> it repeats step 2) or it repeats step 1) and 2)?
> 
> Only step 2) is repeated i guess, right? if it is yes, it works well:
> when the guest writes gpte, the spte of corresponding shadow page is zapped
> (level > 1) or it is speculatively fetched(level == 1), the accessed bit is
> cleared in both case.

Right.

> the later write can detect that the accessed bit is not set, and write_flooding_count
> is increased. finally, the shadow page is zapped, the gpte is written directly.
> 
> >> The result is the shadow page for gfnA is alloced and zapped again and again,
> >> yes?
> > 
> > The point is you cannot rely on the accessed bit of sptes that have been
> > instantiated with the accessed bit set to decide whether or not to zap.
> > Because the accessed bit will only be cleared on host memory pressure.
> > 
> 
> But the accessed bit is also cleared after spte is written.

Right. But only one of the 512 sptes. Worst case, a shadow that has 1
spte with accessed bit at every 3 spte entries would not be zapped for a
linear write of the entire guest pagetable. The current heuristic does 
not suffer from this issue.

I guess it is OK to be more trigger happy with zapping by ignoring
the accessed bit, clearing the flood counter on page fault.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Aug. 25, 2011, 2:06 p.m. UTC | #14
On 08/25/2011 04:21 PM, Marcelo Tosatti wrote:
> On Thu, Aug 25, 2011 at 07:42:10AM +0300, Avi Kivity wrote:
> >  On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:
> >  >>
> >  >>   It could increase the flood count independently of the accessed bit of
> >  >>   the spte being updated, zapping after 3 attempts as it is now.
> >  >>
> >  >>   But additionally reset the flood count if the gpte appears to be valid
> >  >>   (points to an existant gfn if the present bit is set, or if its zeroed).
> >  >
> >  >Well not zero, as thats a common pattern for non ptes.
> >  >
> >
> >  On 32-bit with 4GB RAM, practically anything is a valid gpte.
>
> The following could be required to consider a valid gpte, for write
> flood detection purposes:
>
> - Must be present.
> - PageCacheDisable must be unset.
> - PageWriteThrough must be unset.
>

Unless the guest is using PAT.
Avi Kivity Aug. 25, 2011, 2:07 p.m. UTC | #15
On 08/25/2011 05:06 PM, Avi Kivity wrote:
> On 08/25/2011 04:21 PM, Marcelo Tosatti wrote:
>> On Thu, Aug 25, 2011 at 07:42:10AM +0300, Avi Kivity wrote:
>> >  On 08/25/2011 05:04 AM, Marcelo Tosatti wrote:
>> > >>
>> > >>   It could increase the flood count independently of the 
>> accessed bit of
>> > >>   the spte being updated, zapping after 3 attempts as it is now.
>> > >>
>> > >>   But additionally reset the flood count if the gpte appears to 
>> be valid
>> > >>   (points to an existant gfn if the present bit is set, or if 
>> its zeroed).
>> > >
>> > >Well not zero, as thats a common pattern for non ptes.
>> > >
>> >
>> >  On 32-bit with 4GB RAM, practically anything is a valid gpte.
>>
>> The following could be required to consider a valid gpte, for write
>> flood detection purposes:
>>
>> - Must be present.
>> - PageCacheDisable must be unset.
>> - PageWriteThrough must be unset.
>>
>
> Unless the guest is using PAT.
>

And not swapping.
diff mbox

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 927ba73..9d17238 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -239,6 +239,8 @@  struct kvm_mmu_page {
 	int clear_spte_count;
 #endif
 
+	int write_flooding_count;
+
 	struct rcu_head rcu;
 };
 
@@ -353,10 +355,6 @@  struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_page_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
 
-	gfn_t last_pt_write_gfn;
-	int   last_pt_write_count;
-	u64  *last_pte_updated;
-
 	struct fpu guest_fpu;
 	u64 xcr0;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index adaa160..3230f84 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1695,6 +1695,7 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		} else if (sp->unsync)
 			kvm_mmu_mark_parents_unsync(sp);
 
+		sp->write_flooding_count = 0;
 		trace_kvm_mmu_get_page(sp, false);
 		return sp;
 	}
@@ -1847,15 +1848,6 @@  static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
 	mmu_page_remove_parent_pte(sp, parent_pte);
 }
 
-static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
-{
-	int i;
-	struct kvm_vcpu *vcpu;
-
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		vcpu->arch.last_pte_updated = NULL;
-}
-
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	u64 *parent_pte;
@@ -1915,7 +1907,6 @@  static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 	}
 
 	sp->role.invalid = 1;
-	kvm_mmu_reset_last_pte_updated(kvm);
 	return ret;
 }
 
@@ -2360,8 +2351,6 @@  static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		}
 	}
 	kvm_release_pfn_clean(pfn);
-	if (speculative)
-		vcpu->arch.last_pte_updated = sptep;
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -3522,13 +3511,6 @@  static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, bool zap_page,
 		kvm_mmu_flush_tlb(vcpu);
 }
 
-static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
-{
-	u64 *spte = vcpu->arch.last_pte_updated;
-
-	return !!(spte && (*spte & shadow_accessed_mask));
-}
-
 static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
 				    const u8 *new, int *bytes)
 {
@@ -3569,22 +3551,14 @@  static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
  * If we're seeing too many writes to a page, it may no longer be a page table,
  * or we may be forking, in which case it is better to unmap the page.
  */
-static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
+static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
 {
-	bool flooded = false;
-
-	if (gfn == vcpu->arch.last_pt_write_gfn
-	    && !last_updated_pte_accessed(vcpu)) {
-		++vcpu->arch.last_pt_write_count;
-		if (vcpu->arch.last_pt_write_count >= 3)
-			flooded = true;
-	} else {
-		vcpu->arch.last_pt_write_gfn = gfn;
-		vcpu->arch.last_pt_write_count = 1;
-		vcpu->arch.last_pte_updated = NULL;
-	}
+	if (spte && !(*spte & shadow_accessed_mask))
+		sp->write_flooding_count++;
+	else
+		sp->write_flooding_count = 0;
 
-	return flooded;
+	return sp->write_flooding_count >= 3;
 }
 
 /*
@@ -3656,7 +3630,7 @@  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	LIST_HEAD(invalid_list);
 	u64 entry, gentry, *spte;
 	int npte;
-	bool remote_flush, local_flush, zap_page, flooded, misaligned;
+	bool remote_flush, local_flush, zap_page;
 
 	/*
 	 * If we don't have indirect shadow pages, it means no page is
@@ -3675,12 +3649,12 @@  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	++vcpu->kvm->stat.mmu_pte_write;
 	trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
 
-	flooded = detect_write_flooding(vcpu, gfn);
 	mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn, node) {
-		misaligned = detect_write_misaligned(sp, gpa, bytes);
+		spte = get_written_sptes(sp, gpa, &npte);
 
-		if (misaligned || flooded) {
+		if (detect_write_misaligned(sp, gpa, bytes) ||
+		      detect_write_flooding(sp, spte)) {
 			zap_page |= !!kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 						     &invalid_list);
 			++vcpu->kvm->stat.mmu_flooded;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index bdc2241..ec5c1b4 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -599,11 +599,9 @@  static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	 */
 	if (!r) {
 		pgprintk("%s: guest page fault\n", __func__);
-		if (!prefault) {
+		if (!prefault)
 			inject_page_fault(vcpu, &walker.fault);
-			/* reset fork detector */
-			vcpu->arch.last_pt_write_count = 0;
-		}
+
 		return 0;
 	}
 
@@ -641,9 +639,6 @@  static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	pgprintk("%s: shadow pte %p %llx emulate %d\n", __func__,
 		 sptep, *sptep, emulate);
 
-	if (!emulate)
-		vcpu->arch.last_pt_write_count = 0; /* reset fork detector */
-
 	++vcpu->stat.pf_fixed;
 	trace_kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 	spin_unlock(&vcpu->kvm->mmu_lock);