diff mbox

[v7,10/11] KVM: MMU: collapse TLB flushes when zap all pages

Message ID 1369252560-11611-11-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Xiao Guangrong May 22, 2013, 7:55 p.m. UTC
kvm_zap_obsolete_pages uses lock-break technique to zap pages,
it will flush tlb every time when it does lock-break

We can reload mmu on all vcpus after updating the generation
number so that the obsolete pages are not used on any vcpus,
after that we do not need to flush tlb when obsolete pages
are zapped

Note: kvm_mmu_commit_zap_page is still needed before free
the pages since other vcpus may be doing locklessly shadow
page walking

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/kvm/mmu.c |   32 ++++++++++++++++++++++----------
 1 files changed, 22 insertions(+), 10 deletions(-)

Comments

Gleb Natapov May 23, 2013, 6:12 a.m. UTC | #1
On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
> it will flush tlb every time when it does lock-break
> 
> We can reload mmu on all vcpus after updating the generation
> number so that the obsolete pages are not used on any vcpus,
> after that we do not need to flush tlb when obsolete pages
> are zapped
> 
> Note: kvm_mmu_commit_zap_page is still needed before free
> the pages since other vcpus may be doing locklessly shadow
> page walking
> 
Since obsolete pages are not accessible for lockless page walking after
reload of all roots I do not understand why additional tlb flush is
needed. Also why tlb flush should prevent lockless-walking from using
the page? Making page unreachable from root_hpa does that, no?
 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c |   32 ++++++++++++++++++++++----------
>  1 files changed, 22 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index e676356..5e34056 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4237,8 +4237,6 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
>  restart:
>  	list_for_each_entry_safe_reverse(sp, node,
>  	      &kvm->arch.active_mmu_pages, link) {
> -		int ret;
> -
>  		/*
>  		 * No obsolete page exists before new created page since
>  		 * active_mmu_pages is the FIFO list.
> @@ -4254,21 +4252,24 @@ restart:
>  		if (sp->role.invalid)
>  			continue;
>  
> +		/*
> +		 * Need not flush tlb since we only zap the sp with invalid
> +		 * generation number.
> +		 */
>  		if (batch >= BATCH_ZAP_PAGES &&
> -		      (need_resched() || spin_needbreak(&kvm->mmu_lock))) {
> +		      cond_resched_lock(&kvm->mmu_lock)) {
>  			batch = 0;
> -			kvm_mmu_commit_zap_page(kvm, &invalid_list);
> -			cond_resched_lock(&kvm->mmu_lock);
>  			goto restart;
>  		}
>  
> -		ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> -		batch += ret;
> -
> -		if (ret)
> -			goto restart;
> +		batch += kvm_mmu_prepare_zap_obsolete_page(kvm, sp,
> +						      &invalid_list);
>  	}
>  
> +	/*
> +	 * Should flush tlb before free page tables since lockless-walking
> +	 * may use the pages.
> +	 */
>  	kvm_mmu_commit_zap_page(kvm, &invalid_list);
>  }
>  
> @@ -4287,6 +4288,17 @@ void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
>  	trace_kvm_mmu_invalidate_zap_all_pages(kvm);
>  	kvm->arch.mmu_valid_gen++;
>  
> +	/*
> +	 * Notify all vcpus to reload its shadow page table
> +	 * and flush TLB. Then all vcpus will switch to new
> +	 * shadow page table with the new mmu_valid_gen.
> +	 *
> +	 * Note: we should do this under the protection of
> +	 * mmu-lock, otherwise, vcpu would purge shadow page
> +	 * but miss tlb flush.
> +	 */
> +	kvm_reload_remote_mmus(kvm);
> +
>  	kvm_zap_obsolete_pages(kvm);
>  	spin_unlock(&kvm->mmu_lock);
>  }
> -- 
> 1.7.7.6

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong May 23, 2013, 6:26 a.m. UTC | #2
On 05/23/2013 02:12 PM, Gleb Natapov wrote:
> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
>> it will flush tlb every time when it does lock-break
>>
>> We can reload mmu on all vcpus after updating the generation
>> number so that the obsolete pages are not used on any vcpus,
>> after that we do not need to flush tlb when obsolete pages
>> are zapped
>>
>> Note: kvm_mmu_commit_zap_page is still needed before free
>> the pages since other vcpus may be doing locklessly shadow
>> page walking
>>
> Since obsolete pages are not accessible for lockless page walking after
> reload of all roots I do not understand why additional tlb flush is

kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the
vcpu is not running on guest mode, it does nothing except set the request
bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus()
return on other vcpu.

Like this scenario:

VCPU 0                              VCPU 1
                                 exit when it encounters #PF

kvm_reload_remote_mmus(){
    set vcpu1->request bit;

    do not send IPI due to
    vcpu 1 not running on guest mode

                                 call page-fault handler then go lockless walking !!!
    return
}


> needed. Also why tlb flush should prevent lockless-walking from using
> the page? Making page unreachable from root_hpa does that, no?

lockless-walking disables the interrupt and makes the vcpu state as
READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE,
kvm_flush_remote_tlbs() should send IPI to this vcpu in this case.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov May 23, 2013, 7:24 a.m. UTC | #3
On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote:
> On 05/23/2013 02:12 PM, Gleb Natapov wrote:
> > On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
> >> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
> >> it will flush tlb every time when it does lock-break
> >>
> >> We can reload mmu on all vcpus after updating the generation
> >> number so that the obsolete pages are not used on any vcpus,
> >> after that we do not need to flush tlb when obsolete pages
> >> are zapped
> >>
> >> Note: kvm_mmu_commit_zap_page is still needed before free
> >> the pages since other vcpus may be doing locklessly shadow
> >> page walking
> >>
> > Since obsolete pages are not accessible for lockless page walking after
> > reload of all roots I do not understand why additional tlb flush is
> 
> kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the
> vcpu is not running on guest mode, it does nothing except set the request
> bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus()
> return on other vcpu.
> 
> Like this scenario:
> 
> VCPU 0                              VCPU 1
>                                  exit when it encounters #PF
> 
> kvm_reload_remote_mmus(){
>     set vcpu1->request bit;
> 
>     do not send IPI due to
>     vcpu 1 not running on guest mode
> 
>                                  call page-fault handler then go lockless walking !!!
>     return
> }
> 
> 
> > needed. Also why tlb flush should prevent lockless-walking from using
> > the page? Making page unreachable from root_hpa does that, no?
> 
> lockless-walking disables the interrupt and makes the vcpu state as
> READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE,
> kvm_flush_remote_tlbs() should send IPI to this vcpu in this case.

kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as
kvm_reload_remote_mmus() does, so why the same scenario you describe
above cannot happen with kvm_flush_remote_tlbs()?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong May 23, 2013, 7:37 a.m. UTC | #4
On 05/23/2013 03:24 PM, Gleb Natapov wrote:
> On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote:
>> On 05/23/2013 02:12 PM, Gleb Natapov wrote:
>>> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
>>>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
>>>> it will flush tlb every time when it does lock-break
>>>>
>>>> We can reload mmu on all vcpus after updating the generation
>>>> number so that the obsolete pages are not used on any vcpus,
>>>> after that we do not need to flush tlb when obsolete pages
>>>> are zapped
>>>>
>>>> Note: kvm_mmu_commit_zap_page is still needed before free
>>>> the pages since other vcpus may be doing locklessly shadow
>>>> page walking
>>>>
>>> Since obsolete pages are not accessible for lockless page walking after
>>> reload of all roots I do not understand why additional tlb flush is
>>
>> kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the
>> vcpu is not running on guest mode, it does nothing except set the request
>> bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus()
>> return on other vcpu.
>>
>> Like this scenario:
>>
>> VCPU 0                              VCPU 1
>>                                  exit when it encounters #PF
>>
>> kvm_reload_remote_mmus(){
>>     set vcpu1->request bit;
>>
>>     do not send IPI due to
>>     vcpu 1 not running on guest mode
>>
>>                                  call page-fault handler then go lockless walking !!!
>>     return
>> }
>>
>>
>>> needed. Also why tlb flush should prevent lockless-walking from using
>>> the page? Making page unreachable from root_hpa does that, no?
>>
>> lockless-walking disables the interrupt and makes the vcpu state as
>> READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE,
>> kvm_flush_remote_tlbs() should send IPI to this vcpu in this case.
> 
> kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as
> kvm_reload_remote_mmus() does, so why the same scenario you describe
> above cannot happen with kvm_flush_remote_tlbs()?


After call kvm_flush_remote_tlbs(), the page still exists on vcpu->root,
so we can not protect the page is being used by other vcpu.

But before call kvm_mmu_commit_zap_page(), the page has been deleted from
vcpu's page table, after call kvm_flush_remote_tlbs(), we can ensure that
other vcpus can not find these pages.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong May 23, 2013, 7:38 a.m. UTC | #5
On 05/23/2013 03:37 PM, Xiao Guangrong wrote:
> On 05/23/2013 03:24 PM, Gleb Natapov wrote:
>> On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote:
>>> On 05/23/2013 02:12 PM, Gleb Natapov wrote:
>>>> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
>>>>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
>>>>> it will flush tlb every time when it does lock-break
>>>>>
>>>>> We can reload mmu on all vcpus after updating the generation
>>>>> number so that the obsolete pages are not used on any vcpus,
>>>>> after that we do not need to flush tlb when obsolete pages
>>>>> are zapped
>>>>>
>>>>> Note: kvm_mmu_commit_zap_page is still needed before free
>>>>> the pages since other vcpus may be doing locklessly shadow
>>>>> page walking
>>>>>
>>>> Since obsolete pages are not accessible for lockless page walking after
>>>> reload of all roots I do not understand why additional tlb flush is
>>>
>>> kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the
>>> vcpu is not running on guest mode, it does nothing except set the request
>>> bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus()
>>> return on other vcpu.
>>>
>>> Like this scenario:
>>>
>>> VCPU 0                              VCPU 1
>>>                                  exit when it encounters #PF
>>>
>>> kvm_reload_remote_mmus(){
>>>     set vcpu1->request bit;
>>>
>>>     do not send IPI due to
>>>     vcpu 1 not running on guest mode
>>>
>>>                                  call page-fault handler then go lockless walking !!!
>>>     return
>>> }
>>>
>>>
>>>> needed. Also why tlb flush should prevent lockless-walking from using
>>>> the page? Making page unreachable from root_hpa does that, no?
>>>
>>> lockless-walking disables the interrupt and makes the vcpu state as
>>> READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE,
>>> kvm_flush_remote_tlbs() should send IPI to this vcpu in this case.
>>
>> kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as
>> kvm_reload_remote_mmus() does, so why the same scenario you describe
>> above cannot happen with kvm_flush_remote_tlbs()?
> 
> 
> After call kvm_flush_remote_tlbs(), the page still exists on vcpu->root,

Sorry, should be kvm_reload_remote_mmus() here.

> so we can not protect the page is being used by other vcpu.
> 
> But before call kvm_mmu_commit_zap_page(), the page has been deleted from
> vcpu's page table, after call kvm_flush_remote_tlbs(), we can ensure that
> other vcpus can not find these pages.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov May 23, 2013, 7:56 a.m. UTC | #6
On Thu, May 23, 2013 at 03:38:49PM +0800, Xiao Guangrong wrote:
> On 05/23/2013 03:37 PM, Xiao Guangrong wrote:
> > On 05/23/2013 03:24 PM, Gleb Natapov wrote:
> >> On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote:
> >>> On 05/23/2013 02:12 PM, Gleb Natapov wrote:
> >>>> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
> >>>>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
> >>>>> it will flush tlb every time when it does lock-break
> >>>>>
> >>>>> We can reload mmu on all vcpus after updating the generation
> >>>>> number so that the obsolete pages are not used on any vcpus,
> >>>>> after that we do not need to flush tlb when obsolete pages
> >>>>> are zapped
> >>>>>
> >>>>> Note: kvm_mmu_commit_zap_page is still needed before free
> >>>>> the pages since other vcpus may be doing locklessly shadow
> >>>>> page walking
> >>>>>
> >>>> Since obsolete pages are not accessible for lockless page walking after
> >>>> reload of all roots I do not understand why additional tlb flush is
> >>>
> >>> kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the
> >>> vcpu is not running on guest mode, it does nothing except set the request
> >>> bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus()
> >>> return on other vcpu.
> >>>
> >>> Like this scenario:
> >>>
> >>> VCPU 0                              VCPU 1
> >>>                                  exit when it encounters #PF
> >>>
> >>> kvm_reload_remote_mmus(){
> >>>     set vcpu1->request bit;
> >>>
> >>>     do not send IPI due to
> >>>     vcpu 1 not running on guest mode
> >>>
> >>>                                  call page-fault handler then go lockless walking !!!
> >>>     return
> >>> }
> >>>
> >>>
> >>>> needed. Also why tlb flush should prevent lockless-walking from using
> >>>> the page? Making page unreachable from root_hpa does that, no?
> >>>
> >>> lockless-walking disables the interrupt and makes the vcpu state as
> >>> READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE,
> >>> kvm_flush_remote_tlbs() should send IPI to this vcpu in this case.
> >>
> >> kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as
> >> kvm_reload_remote_mmus() does, so why the same scenario you describe
> >> above cannot happen with kvm_flush_remote_tlbs()?
> > 
> > 
> > After call kvm_flush_remote_tlbs(), the page still exists on vcpu->root,
> 
> Sorry, should be kvm_reload_remote_mmus() here.
> 
> > so we can not protect the page is being used by other vcpu.
> > 
> > But before call kvm_mmu_commit_zap_page(), the page has been deleted from
> > vcpu's page table, after call kvm_flush_remote_tlbs(), we can ensure that
> > other vcpus can not find these pages.
> > 
Ah, I see, so the barrier is needed after page is unlinked from the
vcpu->root hierarchy.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti May 28, 2013, 12:36 a.m. UTC | #7
On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
> it will flush tlb every time when it does lock-break
> 
> We can reload mmu on all vcpus after updating the generation
> number so that the obsolete pages are not used on any vcpus,
> after that we do not need to flush tlb when obsolete pages
> are zapped

After that point batching is also not relevant anymore?


Still concerned about a similar case mentioned earlier:

"
Note the account for pages freed step after pages are actually
freed: as discussed with Takuya, having pages freed and freed page
accounting out of sync across mmu_lock is potentially problematic:
kvm->arch.n_used_mmu_pages and friends do not reflect reality which can
cause problems for SLAB freeing and page allocation throttling.
"

This is a real problem, if you decrease n_used_mmu_pages at
kvm_mmu_prepare_zap_page, but only actually free pages later 
at kvm_mmu_commit_zap_page, there is the possibility of allowing
a huge number to be retained. There should be a maximum number of pages
at invalid_list.

(even higher possibility if you schedule without freeing pages reported
as released!).

> Note: kvm_mmu_commit_zap_page is still needed before free
> the pages since other vcpus may be doing locklessly shadow
> page walking
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/mmu.c |   32 ++++++++++++++++++++++----------
>  1 files changed, 22 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index e676356..5e34056 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4237,8 +4237,6 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
>  restart:
>  	list_for_each_entry_safe_reverse(sp, node,
>  	      &kvm->arch.active_mmu_pages, link) {
> -		int ret;
> -
>  		/*
>  		 * No obsolete page exists before new created page since
>  		 * active_mmu_pages is the FIFO list.
> @@ -4254,21 +4252,24 @@ restart:
>  		if (sp->role.invalid)
>  			continue;
>  
> +		/*
> +		 * Need not flush tlb since we only zap the sp with invalid
> +		 * generation number.
> +		 */
>  		if (batch >= BATCH_ZAP_PAGES &&
> -		      (need_resched() || spin_needbreak(&kvm->mmu_lock))) {
> +		      cond_resched_lock(&kvm->mmu_lock)) {
>  			batch = 0;
> -			kvm_mmu_commit_zap_page(kvm, &invalid_list);
> -			cond_resched_lock(&kvm->mmu_lock);
>  			goto restart;
>  		}
>  
> -		ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> -		batch += ret;
> -
> -		if (ret)
> -			goto restart;
> +		batch += kvm_mmu_prepare_zap_obsolete_page(kvm, sp,
> +						      &invalid_list);
>  	}
>  
> +	/*
> +	 * Should flush tlb before free page tables since lockless-walking
> +	 * may use the pages.
> +	 */
>  	kvm_mmu_commit_zap_page(kvm, &invalid_list);
>  }
>  
> @@ -4287,6 +4288,17 @@ void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
>  	trace_kvm_mmu_invalidate_zap_all_pages(kvm);
>  	kvm->arch.mmu_valid_gen++;
>  
> +	/*
> +	 * Notify all vcpus to reload its shadow page table
> +	 * and flush TLB. Then all vcpus will switch to new
> +	 * shadow page table with the new mmu_valid_gen.
> +	 *
> +	 * Note: we should do this under the protection of
> +	 * mmu-lock, otherwise, vcpu would purge shadow page
> +	 * but miss tlb flush.
> +	 */
> +	kvm_reload_remote_mmus(kvm);
> +
>  	kvm_zap_obsolete_pages(kvm);
>  	spin_unlock(&kvm->mmu_lock);
>  }
> -- 
> 1.7.7.6
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong May 28, 2013, 3:19 p.m. UTC | #8
On 05/28/2013 08:36 AM, Marcelo Tosatti wrote:
> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
>> it will flush tlb every time when it does lock-break
>>
>> We can reload mmu on all vcpus after updating the generation
>> number so that the obsolete pages are not used on any vcpus,
>> after that we do not need to flush tlb when obsolete pages
>> are zapped
> 
> After that point batching is also not relevant anymore?

no... without batching, we do not know how much time we will
spend to zap pages. It is not good for the case that
zap_all_pages is called in the vcpu context.

> 
> 
> Still concerned about a similar case mentioned earlier:
> 
> "
> Note the account for pages freed step after pages are actually
> freed: as discussed with Takuya, having pages freed and freed page
> accounting out of sync across mmu_lock is potentially problematic:
> kvm->arch.n_used_mmu_pages and friends do not reflect reality which can
> cause problems for SLAB freeing and page allocation throttling.
> "
> 
> This is a real problem, if you decrease n_used_mmu_pages at
> kvm_mmu_prepare_zap_page, but only actually free pages later 
> at kvm_mmu_commit_zap_page, there is the possibility of allowing
> a huge number to be retained. There should be a maximum number of pages
> at invalid_list.
> 
> (even higher possibility if you schedule without freeing pages reported
> as released!).
> 
>> Note: kvm_mmu_commit_zap_page is still needed before free
>> the pages since other vcpus may be doing locklessly shadow
>> page walking

Ah, yes, i agree with you.

We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
zapped-page, the page-shrink will free the page on that list first.

Marcelo, if you do not have objection on  patch 1 ~ 8 and 11, could you please
let them merged first, and do add some comments and tlb optimization later?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong May 29, 2013, 3:03 a.m. UTC | #9
On 05/28/2013 11:19 PM, Xiao Guangrong wrote:
> On 05/28/2013 08:36 AM, Marcelo Tosatti wrote:
>> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote:
>>> kvm_zap_obsolete_pages uses lock-break technique to zap pages,
>>> it will flush tlb every time when it does lock-break
>>>
>>> We can reload mmu on all vcpus after updating the generation
>>> number so that the obsolete pages are not used on any vcpus,
>>> after that we do not need to flush tlb when obsolete pages
>>> are zapped
>>
>> After that point batching is also not relevant anymore?
> 
> no... without batching, we do not know how much time we will
> spend to zap pages. It is not good for the case that
> zap_all_pages is called in the vcpu context.
> 
>>
>>
>> Still concerned about a similar case mentioned earlier:
>>
>> "
>> Note the account for pages freed step after pages are actually
>> freed: as discussed with Takuya, having pages freed and freed page
>> accounting out of sync across mmu_lock is potentially problematic:
>> kvm->arch.n_used_mmu_pages and friends do not reflect reality which can
>> cause problems for SLAB freeing and page allocation throttling.
>> "
>>
>> This is a real problem, if you decrease n_used_mmu_pages at
>> kvm_mmu_prepare_zap_page, but only actually free pages later 
>> at kvm_mmu_commit_zap_page, there is the possibility of allowing
>> a huge number to be retained. There should be a maximum number of pages
>> at invalid_list.
>>
>> (even higher possibility if you schedule without freeing pages reported
>> as released!).
>>
>>> Note: kvm_mmu_commit_zap_page is still needed before free
>>> the pages since other vcpus may be doing locklessly shadow
>>> page walking
> 
> Ah, yes, i agree with you.
> 
> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
> zapped-page, the page-shrink will free the page on that list first.
> 
> Marcelo, if you do not have objection on  patch 1 ~ 8 and 11, could you please
> let them merged first, and do add some comments and tlb optimization later?

Exclude patch 11 please, since it depends on the "collapse" optimization.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti May 29, 2013, 12:39 p.m. UTC | #10
On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote:
> >>> the pages since other vcpus may be doing locklessly shadow
> >>> page walking
> > 
> > Ah, yes, i agree with you.
> > 
> > We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
> > zapped-page, the page-shrink will free the page on that list first.
> > 
> > Marcelo, if you do not have objection on  patch 1 ~ 8 and 11, could you please
> > let them merged first, and do add some comments and tlb optimization later?
> 
> Exclude patch 11 please, since it depends on the "collapse" optimization.

I'm fine with patch 1 being merged. I think the remaining patches need better
understanding or explanation. The problems i see are:

1) The magic number "10" to zap before considering reschedule is
annoying. It would be good to understand why it is needed at all.

But then again, the testcase is measuring kvm_mmu_zap_all performance
alone which we know is not a common operation, so perhaps there is
no need for that minimum-pages-to-zap-before-reschedule.

2) The problem above (retention of large number of pages while zapping)
can be fatal, it can lead to OOM and host crash.

3) Make sure that introduction of obsolete pages can not lead to a 
huge number of shadow pages around (the correct reason you gave for not merging
https://patchwork.kernel.org/patch/2309641/ is not true anymore
obsolete pages).

Other than these points, i'm fine with obsolete pages optimization
to speed up kvm_mmu_zap_all and the rest of the patchset.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xiao Guangrong May 29, 2013, 1:19 p.m. UTC | #11
On 05/29/2013 08:39 PM, Marcelo Tosatti wrote:
> On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote:
>>>>> the pages since other vcpus may be doing locklessly shadow
>>>>> page walking
>>>
>>> Ah, yes, i agree with you.
>>>
>>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
>>> zapped-page, the page-shrink will free the page on that list first.
>>>
>>> Marcelo, if you do not have objection on  patch 1 ~ 8 and 11, could you please
>>> let them merged first, and do add some comments and tlb optimization later?
>>
>> Exclude patch 11 please, since it depends on the "collapse" optimization.
> 
> I'm fine with patch 1 being merged. I think the remaining patches need better
> understanding or explanation. The problems i see are:
> 
> 1) The magic number "10" to zap before considering reschedule is
> annoying. It would be good to understand why it is needed at all.

......

> 
> But then again, the testcase is measuring kvm_mmu_zap_all performance
> alone which we know is not a common operation, so perhaps there is
> no need for that minimum-pages-to-zap-before-reschedule.

Well. Although, this is not the common operation, but this operation
can be triggered by VCPU - it one VCPU take long time on zap-all-pages,
other vcpus is missing IPI-synce, or missing IO. This is easily cause
soft lockups if the vcpu is doing memslot-releated things.

> 
> 2) The problem above (retention of large number of pages while zapping)
> can be fatal, it can lead to OOM and host crash.

This problem is introduced in this patch (patch 10), without this patch,
the pages are always be zapped before release the mmu-lock.

> 
> 3) Make sure that introduction of obsolete pages can not lead to a 
> huge number of shadow pages around (the correct reason you gave for not merging
> https://patchwork.kernel.org/patch/2309641/ is not true anymore
> obsolete pages).

Actually, this question is the same as 2). It also the page-reclaim problem.
Without patch #10, no problem.

> 
> Other than these points, i'm fine with obsolete pages optimization
> to speed up kvm_mmu_zap_all and the rest of the patchset.

Thank you!


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov May 30, 2013, 12:53 a.m. UTC | #12
On Wed, May 29, 2013 at 09:19:41PM +0800, Xiao Guangrong wrote:
> On 05/29/2013 08:39 PM, Marcelo Tosatti wrote:
> > On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote:
> >>>>> the pages since other vcpus may be doing locklessly shadow
> >>>>> page walking
> >>>
> >>> Ah, yes, i agree with you.
> >>>
> >>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
> >>> zapped-page, the page-shrink will free the page on that list first.
> >>>
> >>> Marcelo, if you do not have objection on  patch 1 ~ 8 and 11, could you please
> >>> let them merged first, and do add some comments and tlb optimization later?
> >>
> >> Exclude patch 11 please, since it depends on the "collapse" optimization.
> > 
> > I'm fine with patch 1 being merged. I think the remaining patches need better
> > understanding or explanation. The problems i see are:
> > 
> > 1) The magic number "10" to zap before considering reschedule is
> > annoying. It would be good to understand why it is needed at all.
> 
> ......
> 
> > 
> > But then again, the testcase is measuring kvm_mmu_zap_all performance
> > alone which we know is not a common operation, so perhaps there is
> > no need for that minimum-pages-to-zap-before-reschedule.
> 
> Well. Although, this is not the common operation, but this operation
> can be triggered by VCPU - it one VCPU take long time on zap-all-pages,
> other vcpus is missing IPI-synce, or missing IO. This is easily cause
> soft lockups if the vcpu is doing memslot-releated things.
> 
+1. If it is trigarable by a guest it may slow down the guest, but we
should not allow for it to slow down a host.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takuya Yoshikawa May 30, 2013, 4:24 p.m. UTC | #13
On Thu, 30 May 2013 03:53:38 +0300
Gleb Natapov <gleb@redhat.com> wrote:

> On Wed, May 29, 2013 at 09:19:41PM +0800, Xiao Guangrong wrote:
> > On 05/29/2013 08:39 PM, Marcelo Tosatti wrote:
> > > On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote:
> > >>>>> the pages since other vcpus may be doing locklessly shadow
> > >>>>> page walking
> > >>>
> > >>> Ah, yes, i agree with you.
> > >>>
> > >>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
> > >>> zapped-page, the page-shrink will free the page on that list first.
> > >>>
> > >>> Marcelo, if you do not have objection on  patch 1 ~ 8 and 11, could you please
> > >>> let them merged first, and do add some comments and tlb optimization later?
> > >>
> > >> Exclude patch 11 please, since it depends on the "collapse" optimization.
> > > 
> > > I'm fine with patch 1 being merged. I think the remaining patches need better
> > > understanding or explanation. The problems i see are:
> > > 
> > > 1) The magic number "10" to zap before considering reschedule is
> > > annoying. It would be good to understand why it is needed at all.
> > 
> > ......
> > 
> > > 
> > > But then again, the testcase is measuring kvm_mmu_zap_all performance
> > > alone which we know is not a common operation, so perhaps there is
> > > no need for that minimum-pages-to-zap-before-reschedule.
> > 
> > Well. Although, this is not the common operation, but this operation
> > can be triggered by VCPU - it one VCPU take long time on zap-all-pages,
> > other vcpus is missing IPI-synce, or missing IO. This is easily cause
> > soft lockups if the vcpu is doing memslot-releated things.
> > 
> +1. If it is trigarable by a guest it may slow down the guest, but we
> should not allow for it to slow down a host.
> 

Well, I don't object to the minimum-pages-to-zap-before-reschedule idea
itself, but if you're going to take patch 4, please at least add a warning
in the changelog that the magic number "10" was selected without good enough
reasoning.

"[ It improves kernel building 0.6% ~ 1% ]" alone will make it hard for
others to change the number later.

I actually once tried to do a similar thing for other code.  So I have a
possible reasoning for this, and 10 should probably be changed later.

	Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takuya Yoshikawa May 30, 2013, 5:10 p.m. UTC | #14
On Fri, 31 May 2013 01:24:43 +0900
Takuya Yoshikawa <takuya.yoshikawa@gmail.com> wrote:

> On Thu, 30 May 2013 03:53:38 +0300
> Gleb Natapov <gleb@redhat.com> wrote:
> 
> > On Wed, May 29, 2013 at 09:19:41PM +0800, Xiao Guangrong wrote:
> > > On 05/29/2013 08:39 PM, Marcelo Tosatti wrote:
> > > > On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote:
> > > >>>>> the pages since other vcpus may be doing locklessly shadow
> > > >>>>> page walking
> > > >>>
> > > >>> Ah, yes, i agree with you.
> > > >>>
> > > >>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the
> > > >>> zapped-page, the page-shrink will free the page on that list first.
> > > >>>
> > > >>> Marcelo, if you do not have objection on  patch 1 ~ 8 and 11, could you please
> > > >>> let them merged first, and do add some comments and tlb optimization later?
> > > >>
> > > >> Exclude patch 11 please, since it depends on the "collapse" optimization.
> > > > 
> > > > I'm fine with patch 1 being merged. I think the remaining patches need better
> > > > understanding or explanation. The problems i see are:
> > > > 
> > > > 1) The magic number "10" to zap before considering reschedule is
> > > > annoying. It would be good to understand why it is needed at all.
> > > 
> > > ......
> > > 
> > > > 
> > > > But then again, the testcase is measuring kvm_mmu_zap_all performance
> > > > alone which we know is not a common operation, so perhaps there is
> > > > no need for that minimum-pages-to-zap-before-reschedule.
> > > 
> > > Well. Although, this is not the common operation, but this operation
> > > can be triggered by VCPU - it one VCPU take long time on zap-all-pages,
> > > other vcpus is missing IPI-synce, or missing IO. This is easily cause
> > > soft lockups if the vcpu is doing memslot-releated things.
> > > 
> > +1. If it is trigarable by a guest it may slow down the guest, but we
> > should not allow for it to slow down a host.
> > 
> 
> Well, I don't object to the minimum-pages-to-zap-before-reschedule idea
> itself, but if you're going to take patch 4, please at least add a warning
> in the changelog that the magic number "10" was selected without good enough
> reasoning.
> 
> "[ It improves kernel building 0.6% ~ 1% ]" alone will make it hard for
> others to change the number later.
> 
> I actually once tried to do a similar thing for other code.  So I have a
> possible reasoning for this, and 10 should probably be changed later.
> 

In this case, the solution seems to be very simple: just drop spin_needbreak()
and leave need_resched() alone.

This way we can guarantee that zap-all will get a fair amount of CPU time for
each scheduling from the host scheduler's point of view.  Of course this can
block other VCPU threads waiting for mmu_lock during that time slice, but
should be much better than blocking them for some magical number of zappings.

We also need to remember that spin_needbreak() does not do anything for some
preempt config settings.

	Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e676356..5e34056 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4237,8 +4237,6 @@  static void kvm_zap_obsolete_pages(struct kvm *kvm)
 restart:
 	list_for_each_entry_safe_reverse(sp, node,
 	      &kvm->arch.active_mmu_pages, link) {
-		int ret;
-
 		/*
 		 * No obsolete page exists before new created page since
 		 * active_mmu_pages is the FIFO list.
@@ -4254,21 +4252,24 @@  restart:
 		if (sp->role.invalid)
 			continue;
 
+		/*
+		 * Need not flush tlb since we only zap the sp with invalid
+		 * generation number.
+		 */
 		if (batch >= BATCH_ZAP_PAGES &&
-		      (need_resched() || spin_needbreak(&kvm->mmu_lock))) {
+		      cond_resched_lock(&kvm->mmu_lock)) {
 			batch = 0;
-			kvm_mmu_commit_zap_page(kvm, &invalid_list);
-			cond_resched_lock(&kvm->mmu_lock);
 			goto restart;
 		}
 
-		ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
-		batch += ret;
-
-		if (ret)
-			goto restart;
+		batch += kvm_mmu_prepare_zap_obsolete_page(kvm, sp,
+						      &invalid_list);
 	}
 
+	/*
+	 * Should flush tlb before free page tables since lockless-walking
+	 * may use the pages.
+	 */
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 }
 
@@ -4287,6 +4288,17 @@  void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm)
 	trace_kvm_mmu_invalidate_zap_all_pages(kvm);
 	kvm->arch.mmu_valid_gen++;
 
+	/*
+	 * Notify all vcpus to reload its shadow page table
+	 * and flush TLB. Then all vcpus will switch to new
+	 * shadow page table with the new mmu_valid_gen.
+	 *
+	 * Note: we should do this under the protection of
+	 * mmu-lock, otherwise, vcpu would purge shadow page
+	 * but miss tlb flush.
+	 */
+	kvm_reload_remote_mmus(kvm);
+
 	kvm_zap_obsolete_pages(kvm);
 	spin_unlock(&kvm->mmu_lock);
 }