Message ID | 1369252560-11611-11-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote: > kvm_zap_obsolete_pages uses lock-break technique to zap pages, > it will flush tlb every time when it does lock-break > > We can reload mmu on all vcpus after updating the generation > number so that the obsolete pages are not used on any vcpus, > after that we do not need to flush tlb when obsolete pages > are zapped > > Note: kvm_mmu_commit_zap_page is still needed before free > the pages since other vcpus may be doing locklessly shadow > page walking > Since obsolete pages are not accessible for lockless page walking after reload of all roots I do not understand why additional tlb flush is needed. Also why tlb flush should prevent lockless-walking from using the page? Making page unreachable from root_hpa does that, no? > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > --- > arch/x86/kvm/mmu.c | 32 ++++++++++++++++++++++---------- > 1 files changed, 22 insertions(+), 10 deletions(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index e676356..5e34056 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -4237,8 +4237,6 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm) > restart: > list_for_each_entry_safe_reverse(sp, node, > &kvm->arch.active_mmu_pages, link) { > - int ret; > - > /* > * No obsolete page exists before new created page since > * active_mmu_pages is the FIFO list. > @@ -4254,21 +4252,24 @@ restart: > if (sp->role.invalid) > continue; > > + /* > + * Need not flush tlb since we only zap the sp with invalid > + * generation number. > + */ > if (batch >= BATCH_ZAP_PAGES && > - (need_resched() || spin_needbreak(&kvm->mmu_lock))) { > + cond_resched_lock(&kvm->mmu_lock)) { > batch = 0; > - kvm_mmu_commit_zap_page(kvm, &invalid_list); > - cond_resched_lock(&kvm->mmu_lock); > goto restart; > } > > - ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); > - batch += ret; > - > - if (ret) > - goto restart; > + batch += kvm_mmu_prepare_zap_obsolete_page(kvm, sp, > + &invalid_list); > } > > + /* > + * Should flush tlb before free page tables since lockless-walking > + * may use the pages. > + */ > kvm_mmu_commit_zap_page(kvm, &invalid_list); > } > > @@ -4287,6 +4288,17 @@ void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm) > trace_kvm_mmu_invalidate_zap_all_pages(kvm); > kvm->arch.mmu_valid_gen++; > > + /* > + * Notify all vcpus to reload its shadow page table > + * and flush TLB. Then all vcpus will switch to new > + * shadow page table with the new mmu_valid_gen. > + * > + * Note: we should do this under the protection of > + * mmu-lock, otherwise, vcpu would purge shadow page > + * but miss tlb flush. > + */ > + kvm_reload_remote_mmus(kvm); > + > kvm_zap_obsolete_pages(kvm); > spin_unlock(&kvm->mmu_lock); > } > -- > 1.7.7.6 -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2013 02:12 PM, Gleb Natapov wrote: > On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote: >> kvm_zap_obsolete_pages uses lock-break technique to zap pages, >> it will flush tlb every time when it does lock-break >> >> We can reload mmu on all vcpus after updating the generation >> number so that the obsolete pages are not used on any vcpus, >> after that we do not need to flush tlb when obsolete pages >> are zapped >> >> Note: kvm_mmu_commit_zap_page is still needed before free >> the pages since other vcpus may be doing locklessly shadow >> page walking >> > Since obsolete pages are not accessible for lockless page walking after > reload of all roots I do not understand why additional tlb flush is kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the vcpu is not running on guest mode, it does nothing except set the request bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus() return on other vcpu. Like this scenario: VCPU 0 VCPU 1 exit when it encounters #PF kvm_reload_remote_mmus(){ set vcpu1->request bit; do not send IPI due to vcpu 1 not running on guest mode call page-fault handler then go lockless walking !!! return } > needed. Also why tlb flush should prevent lockless-walking from using > the page? Making page unreachable from root_hpa does that, no? lockless-walking disables the interrupt and makes the vcpu state as READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE, kvm_flush_remote_tlbs() should send IPI to this vcpu in this case. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote: > On 05/23/2013 02:12 PM, Gleb Natapov wrote: > > On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote: > >> kvm_zap_obsolete_pages uses lock-break technique to zap pages, > >> it will flush tlb every time when it does lock-break > >> > >> We can reload mmu on all vcpus after updating the generation > >> number so that the obsolete pages are not used on any vcpus, > >> after that we do not need to flush tlb when obsolete pages > >> are zapped > >> > >> Note: kvm_mmu_commit_zap_page is still needed before free > >> the pages since other vcpus may be doing locklessly shadow > >> page walking > >> > > Since obsolete pages are not accessible for lockless page walking after > > reload of all roots I do not understand why additional tlb flush is > > kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the > vcpu is not running on guest mode, it does nothing except set the request > bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus() > return on other vcpu. > > Like this scenario: > > VCPU 0 VCPU 1 > exit when it encounters #PF > > kvm_reload_remote_mmus(){ > set vcpu1->request bit; > > do not send IPI due to > vcpu 1 not running on guest mode > > call page-fault handler then go lockless walking !!! > return > } > > > > needed. Also why tlb flush should prevent lockless-walking from using > > the page? Making page unreachable from root_hpa does that, no? > > lockless-walking disables the interrupt and makes the vcpu state as > READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE, > kvm_flush_remote_tlbs() should send IPI to this vcpu in this case. kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as kvm_reload_remote_mmus() does, so why the same scenario you describe above cannot happen with kvm_flush_remote_tlbs()? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2013 03:24 PM, Gleb Natapov wrote: > On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote: >> On 05/23/2013 02:12 PM, Gleb Natapov wrote: >>> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote: >>>> kvm_zap_obsolete_pages uses lock-break technique to zap pages, >>>> it will flush tlb every time when it does lock-break >>>> >>>> We can reload mmu on all vcpus after updating the generation >>>> number so that the obsolete pages are not used on any vcpus, >>>> after that we do not need to flush tlb when obsolete pages >>>> are zapped >>>> >>>> Note: kvm_mmu_commit_zap_page is still needed before free >>>> the pages since other vcpus may be doing locklessly shadow >>>> page walking >>>> >>> Since obsolete pages are not accessible for lockless page walking after >>> reload of all roots I do not understand why additional tlb flush is >> >> kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the >> vcpu is not running on guest mode, it does nothing except set the request >> bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus() >> return on other vcpu. >> >> Like this scenario: >> >> VCPU 0 VCPU 1 >> exit when it encounters #PF >> >> kvm_reload_remote_mmus(){ >> set vcpu1->request bit; >> >> do not send IPI due to >> vcpu 1 not running on guest mode >> >> call page-fault handler then go lockless walking !!! >> return >> } >> >> >>> needed. Also why tlb flush should prevent lockless-walking from using >>> the page? Making page unreachable from root_hpa does that, no? >> >> lockless-walking disables the interrupt and makes the vcpu state as >> READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE, >> kvm_flush_remote_tlbs() should send IPI to this vcpu in this case. > > kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as > kvm_reload_remote_mmus() does, so why the same scenario you describe > above cannot happen with kvm_flush_remote_tlbs()? After call kvm_flush_remote_tlbs(), the page still exists on vcpu->root, so we can not protect the page is being used by other vcpu. But before call kvm_mmu_commit_zap_page(), the page has been deleted from vcpu's page table, after call kvm_flush_remote_tlbs(), we can ensure that other vcpus can not find these pages. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2013 03:37 PM, Xiao Guangrong wrote: > On 05/23/2013 03:24 PM, Gleb Natapov wrote: >> On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote: >>> On 05/23/2013 02:12 PM, Gleb Natapov wrote: >>>> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote: >>>>> kvm_zap_obsolete_pages uses lock-break technique to zap pages, >>>>> it will flush tlb every time when it does lock-break >>>>> >>>>> We can reload mmu on all vcpus after updating the generation >>>>> number so that the obsolete pages are not used on any vcpus, >>>>> after that we do not need to flush tlb when obsolete pages >>>>> are zapped >>>>> >>>>> Note: kvm_mmu_commit_zap_page is still needed before free >>>>> the pages since other vcpus may be doing locklessly shadow >>>>> page walking >>>>> >>>> Since obsolete pages are not accessible for lockless page walking after >>>> reload of all roots I do not understand why additional tlb flush is >>> >>> kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the >>> vcpu is not running on guest mode, it does nothing except set the request >>> bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus() >>> return on other vcpu. >>> >>> Like this scenario: >>> >>> VCPU 0 VCPU 1 >>> exit when it encounters #PF >>> >>> kvm_reload_remote_mmus(){ >>> set vcpu1->request bit; >>> >>> do not send IPI due to >>> vcpu 1 not running on guest mode >>> >>> call page-fault handler then go lockless walking !!! >>> return >>> } >>> >>> >>>> needed. Also why tlb flush should prevent lockless-walking from using >>>> the page? Making page unreachable from root_hpa does that, no? >>> >>> lockless-walking disables the interrupt and makes the vcpu state as >>> READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE, >>> kvm_flush_remote_tlbs() should send IPI to this vcpu in this case. >> >> kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as >> kvm_reload_remote_mmus() does, so why the same scenario you describe >> above cannot happen with kvm_flush_remote_tlbs()? > > > After call kvm_flush_remote_tlbs(), the page still exists on vcpu->root, Sorry, should be kvm_reload_remote_mmus() here. > so we can not protect the page is being used by other vcpu. > > But before call kvm_mmu_commit_zap_page(), the page has been deleted from > vcpu's page table, after call kvm_flush_remote_tlbs(), we can ensure that > other vcpus can not find these pages. > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 23, 2013 at 03:38:49PM +0800, Xiao Guangrong wrote: > On 05/23/2013 03:37 PM, Xiao Guangrong wrote: > > On 05/23/2013 03:24 PM, Gleb Natapov wrote: > >> On Thu, May 23, 2013 at 02:26:57PM +0800, Xiao Guangrong wrote: > >>> On 05/23/2013 02:12 PM, Gleb Natapov wrote: > >>>> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote: > >>>>> kvm_zap_obsolete_pages uses lock-break technique to zap pages, > >>>>> it will flush tlb every time when it does lock-break > >>>>> > >>>>> We can reload mmu on all vcpus after updating the generation > >>>>> number so that the obsolete pages are not used on any vcpus, > >>>>> after that we do not need to flush tlb when obsolete pages > >>>>> are zapped > >>>>> > >>>>> Note: kvm_mmu_commit_zap_page is still needed before free > >>>>> the pages since other vcpus may be doing locklessly shadow > >>>>> page walking > >>>>> > >>>> Since obsolete pages are not accessible for lockless page walking after > >>>> reload of all roots I do not understand why additional tlb flush is > >>> > >>> kvm_reload_remote_mmus() forces vcpus to leave guest mode, but if the > >>> vcpu is not running on guest mode, it does nothing except set the request > >>> bit. So, the vcpu can go lockless page walking after kvm_reload_remote_mmus() > >>> return on other vcpu. > >>> > >>> Like this scenario: > >>> > >>> VCPU 0 VCPU 1 > >>> exit when it encounters #PF > >>> > >>> kvm_reload_remote_mmus(){ > >>> set vcpu1->request bit; > >>> > >>> do not send IPI due to > >>> vcpu 1 not running on guest mode > >>> > >>> call page-fault handler then go lockless walking !!! > >>> return > >>> } > >>> > >>> > >>>> needed. Also why tlb flush should prevent lockless-walking from using > >>>> the page? Making page unreachable from root_hpa does that, no? > >>> > >>> lockless-walking disables the interrupt and makes the vcpu state as > >>> READING_SHADOW_PAGE_TABLES, this state is treated as GUEST_MODE, > >>> kvm_flush_remote_tlbs() should send IPI to this vcpu in this case. > >> > >> kvm_flush_remote_tlbs() uses the same make_all_cpus_request() as > >> kvm_reload_remote_mmus() does, so why the same scenario you describe > >> above cannot happen with kvm_flush_remote_tlbs()? > > > > > > After call kvm_flush_remote_tlbs(), the page still exists on vcpu->root, > > Sorry, should be kvm_reload_remote_mmus() here. > > > so we can not protect the page is being used by other vcpu. > > > > But before call kvm_mmu_commit_zap_page(), the page has been deleted from > > vcpu's page table, after call kvm_flush_remote_tlbs(), we can ensure that > > other vcpus can not find these pages. > > Ah, I see, so the barrier is needed after page is unlinked from the vcpu->root hierarchy. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote: > kvm_zap_obsolete_pages uses lock-break technique to zap pages, > it will flush tlb every time when it does lock-break > > We can reload mmu on all vcpus after updating the generation > number so that the obsolete pages are not used on any vcpus, > after that we do not need to flush tlb when obsolete pages > are zapped After that point batching is also not relevant anymore? Still concerned about a similar case mentioned earlier: " Note the account for pages freed step after pages are actually freed: as discussed with Takuya, having pages freed and freed page accounting out of sync across mmu_lock is potentially problematic: kvm->arch.n_used_mmu_pages and friends do not reflect reality which can cause problems for SLAB freeing and page allocation throttling. " This is a real problem, if you decrease n_used_mmu_pages at kvm_mmu_prepare_zap_page, but only actually free pages later at kvm_mmu_commit_zap_page, there is the possibility of allowing a huge number to be retained. There should be a maximum number of pages at invalid_list. (even higher possibility if you schedule without freeing pages reported as released!). > Note: kvm_mmu_commit_zap_page is still needed before free > the pages since other vcpus may be doing locklessly shadow > page walking > > Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> > --- > arch/x86/kvm/mmu.c | 32 ++++++++++++++++++++++---------- > 1 files changed, 22 insertions(+), 10 deletions(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index e676356..5e34056 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -4237,8 +4237,6 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm) > restart: > list_for_each_entry_safe_reverse(sp, node, > &kvm->arch.active_mmu_pages, link) { > - int ret; > - > /* > * No obsolete page exists before new created page since > * active_mmu_pages is the FIFO list. > @@ -4254,21 +4252,24 @@ restart: > if (sp->role.invalid) > continue; > > + /* > + * Need not flush tlb since we only zap the sp with invalid > + * generation number. > + */ > if (batch >= BATCH_ZAP_PAGES && > - (need_resched() || spin_needbreak(&kvm->mmu_lock))) { > + cond_resched_lock(&kvm->mmu_lock)) { > batch = 0; > - kvm_mmu_commit_zap_page(kvm, &invalid_list); > - cond_resched_lock(&kvm->mmu_lock); > goto restart; > } > > - ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); > - batch += ret; > - > - if (ret) > - goto restart; > + batch += kvm_mmu_prepare_zap_obsolete_page(kvm, sp, > + &invalid_list); > } > > + /* > + * Should flush tlb before free page tables since lockless-walking > + * may use the pages. > + */ > kvm_mmu_commit_zap_page(kvm, &invalid_list); > } > > @@ -4287,6 +4288,17 @@ void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm) > trace_kvm_mmu_invalidate_zap_all_pages(kvm); > kvm->arch.mmu_valid_gen++; > > + /* > + * Notify all vcpus to reload its shadow page table > + * and flush TLB. Then all vcpus will switch to new > + * shadow page table with the new mmu_valid_gen. > + * > + * Note: we should do this under the protection of > + * mmu-lock, otherwise, vcpu would purge shadow page > + * but miss tlb flush. > + */ > + kvm_reload_remote_mmus(kvm); > + > kvm_zap_obsolete_pages(kvm); > spin_unlock(&kvm->mmu_lock); > } > -- > 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/28/2013 08:36 AM, Marcelo Tosatti wrote: > On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote: >> kvm_zap_obsolete_pages uses lock-break technique to zap pages, >> it will flush tlb every time when it does lock-break >> >> We can reload mmu on all vcpus after updating the generation >> number so that the obsolete pages are not used on any vcpus, >> after that we do not need to flush tlb when obsolete pages >> are zapped > > After that point batching is also not relevant anymore? no... without batching, we do not know how much time we will spend to zap pages. It is not good for the case that zap_all_pages is called in the vcpu context. > > > Still concerned about a similar case mentioned earlier: > > " > Note the account for pages freed step after pages are actually > freed: as discussed with Takuya, having pages freed and freed page > accounting out of sync across mmu_lock is potentially problematic: > kvm->arch.n_used_mmu_pages and friends do not reflect reality which can > cause problems for SLAB freeing and page allocation throttling. > " > > This is a real problem, if you decrease n_used_mmu_pages at > kvm_mmu_prepare_zap_page, but only actually free pages later > at kvm_mmu_commit_zap_page, there is the possibility of allowing > a huge number to be retained. There should be a maximum number of pages > at invalid_list. > > (even higher possibility if you schedule without freeing pages reported > as released!). > >> Note: kvm_mmu_commit_zap_page is still needed before free >> the pages since other vcpus may be doing locklessly shadow >> page walking Ah, yes, i agree with you. We can introduce a list, say kvm->arch.obsolte_pages, to link all of the zapped-page, the page-shrink will free the page on that list first. Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please let them merged first, and do add some comments and tlb optimization later? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/28/2013 11:19 PM, Xiao Guangrong wrote: > On 05/28/2013 08:36 AM, Marcelo Tosatti wrote: >> On Thu, May 23, 2013 at 03:55:59AM +0800, Xiao Guangrong wrote: >>> kvm_zap_obsolete_pages uses lock-break technique to zap pages, >>> it will flush tlb every time when it does lock-break >>> >>> We can reload mmu on all vcpus after updating the generation >>> number so that the obsolete pages are not used on any vcpus, >>> after that we do not need to flush tlb when obsolete pages >>> are zapped >> >> After that point batching is also not relevant anymore? > > no... without batching, we do not know how much time we will > spend to zap pages. It is not good for the case that > zap_all_pages is called in the vcpu context. > >> >> >> Still concerned about a similar case mentioned earlier: >> >> " >> Note the account for pages freed step after pages are actually >> freed: as discussed with Takuya, having pages freed and freed page >> accounting out of sync across mmu_lock is potentially problematic: >> kvm->arch.n_used_mmu_pages and friends do not reflect reality which can >> cause problems for SLAB freeing and page allocation throttling. >> " >> >> This is a real problem, if you decrease n_used_mmu_pages at >> kvm_mmu_prepare_zap_page, but only actually free pages later >> at kvm_mmu_commit_zap_page, there is the possibility of allowing >> a huge number to be retained. There should be a maximum number of pages >> at invalid_list. >> >> (even higher possibility if you schedule without freeing pages reported >> as released!). >> >>> Note: kvm_mmu_commit_zap_page is still needed before free >>> the pages since other vcpus may be doing locklessly shadow >>> page walking > > Ah, yes, i agree with you. > > We can introduce a list, say kvm->arch.obsolte_pages, to link all of the > zapped-page, the page-shrink will free the page on that list first. > > Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please > let them merged first, and do add some comments and tlb optimization later? Exclude patch 11 please, since it depends on the "collapse" optimization. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote: > >>> the pages since other vcpus may be doing locklessly shadow > >>> page walking > > > > Ah, yes, i agree with you. > > > > We can introduce a list, say kvm->arch.obsolte_pages, to link all of the > > zapped-page, the page-shrink will free the page on that list first. > > > > Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please > > let them merged first, and do add some comments and tlb optimization later? > > Exclude patch 11 please, since it depends on the "collapse" optimization. I'm fine with patch 1 being merged. I think the remaining patches need better understanding or explanation. The problems i see are: 1) The magic number "10" to zap before considering reschedule is annoying. It would be good to understand why it is needed at all. But then again, the testcase is measuring kvm_mmu_zap_all performance alone which we know is not a common operation, so perhaps there is no need for that minimum-pages-to-zap-before-reschedule. 2) The problem above (retention of large number of pages while zapping) can be fatal, it can lead to OOM and host crash. 3) Make sure that introduction of obsolete pages can not lead to a huge number of shadow pages around (the correct reason you gave for not merging https://patchwork.kernel.org/patch/2309641/ is not true anymore obsolete pages). Other than these points, i'm fine with obsolete pages optimization to speed up kvm_mmu_zap_all and the rest of the patchset. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/29/2013 08:39 PM, Marcelo Tosatti wrote: > On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote: >>>>> the pages since other vcpus may be doing locklessly shadow >>>>> page walking >>> >>> Ah, yes, i agree with you. >>> >>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the >>> zapped-page, the page-shrink will free the page on that list first. >>> >>> Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please >>> let them merged first, and do add some comments and tlb optimization later? >> >> Exclude patch 11 please, since it depends on the "collapse" optimization. > > I'm fine with patch 1 being merged. I think the remaining patches need better > understanding or explanation. The problems i see are: > > 1) The magic number "10" to zap before considering reschedule is > annoying. It would be good to understand why it is needed at all. ...... > > But then again, the testcase is measuring kvm_mmu_zap_all performance > alone which we know is not a common operation, so perhaps there is > no need for that minimum-pages-to-zap-before-reschedule. Well. Although, this is not the common operation, but this operation can be triggered by VCPU - it one VCPU take long time on zap-all-pages, other vcpus is missing IPI-synce, or missing IO. This is easily cause soft lockups if the vcpu is doing memslot-releated things. > > 2) The problem above (retention of large number of pages while zapping) > can be fatal, it can lead to OOM and host crash. This problem is introduced in this patch (patch 10), without this patch, the pages are always be zapped before release the mmu-lock. > > 3) Make sure that introduction of obsolete pages can not lead to a > huge number of shadow pages around (the correct reason you gave for not merging > https://patchwork.kernel.org/patch/2309641/ is not true anymore > obsolete pages). Actually, this question is the same as 2). It also the page-reclaim problem. Without patch #10, no problem. > > Other than these points, i'm fine with obsolete pages optimization > to speed up kvm_mmu_zap_all and the rest of the patchset. Thank you! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 29, 2013 at 09:19:41PM +0800, Xiao Guangrong wrote: > On 05/29/2013 08:39 PM, Marcelo Tosatti wrote: > > On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote: > >>>>> the pages since other vcpus may be doing locklessly shadow > >>>>> page walking > >>> > >>> Ah, yes, i agree with you. > >>> > >>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the > >>> zapped-page, the page-shrink will free the page on that list first. > >>> > >>> Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please > >>> let them merged first, and do add some comments and tlb optimization later? > >> > >> Exclude patch 11 please, since it depends on the "collapse" optimization. > > > > I'm fine with patch 1 being merged. I think the remaining patches need better > > understanding or explanation. The problems i see are: > > > > 1) The magic number "10" to zap before considering reschedule is > > annoying. It would be good to understand why it is needed at all. > > ...... > > > > > But then again, the testcase is measuring kvm_mmu_zap_all performance > > alone which we know is not a common operation, so perhaps there is > > no need for that minimum-pages-to-zap-before-reschedule. > > Well. Although, this is not the common operation, but this operation > can be triggered by VCPU - it one VCPU take long time on zap-all-pages, > other vcpus is missing IPI-synce, or missing IO. This is easily cause > soft lockups if the vcpu is doing memslot-releated things. > +1. If it is trigarable by a guest it may slow down the guest, but we should not allow for it to slow down a host. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 30 May 2013 03:53:38 +0300 Gleb Natapov <gleb@redhat.com> wrote: > On Wed, May 29, 2013 at 09:19:41PM +0800, Xiao Guangrong wrote: > > On 05/29/2013 08:39 PM, Marcelo Tosatti wrote: > > > On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote: > > >>>>> the pages since other vcpus may be doing locklessly shadow > > >>>>> page walking > > >>> > > >>> Ah, yes, i agree with you. > > >>> > > >>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the > > >>> zapped-page, the page-shrink will free the page on that list first. > > >>> > > >>> Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please > > >>> let them merged first, and do add some comments and tlb optimization later? > > >> > > >> Exclude patch 11 please, since it depends on the "collapse" optimization. > > > > > > I'm fine with patch 1 being merged. I think the remaining patches need better > > > understanding or explanation. The problems i see are: > > > > > > 1) The magic number "10" to zap before considering reschedule is > > > annoying. It would be good to understand why it is needed at all. > > > > ...... > > > > > > > > But then again, the testcase is measuring kvm_mmu_zap_all performance > > > alone which we know is not a common operation, so perhaps there is > > > no need for that minimum-pages-to-zap-before-reschedule. > > > > Well. Although, this is not the common operation, but this operation > > can be triggered by VCPU - it one VCPU take long time on zap-all-pages, > > other vcpus is missing IPI-synce, or missing IO. This is easily cause > > soft lockups if the vcpu is doing memslot-releated things. > > > +1. If it is trigarable by a guest it may slow down the guest, but we > should not allow for it to slow down a host. > Well, I don't object to the minimum-pages-to-zap-before-reschedule idea itself, but if you're going to take patch 4, please at least add a warning in the changelog that the magic number "10" was selected without good enough reasoning. "[ It improves kernel building 0.6% ~ 1% ]" alone will make it hard for others to change the number later. I actually once tried to do a similar thing for other code. So I have a possible reasoning for this, and 10 should probably be changed later. Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 31 May 2013 01:24:43 +0900 Takuya Yoshikawa <takuya.yoshikawa@gmail.com> wrote: > On Thu, 30 May 2013 03:53:38 +0300 > Gleb Natapov <gleb@redhat.com> wrote: > > > On Wed, May 29, 2013 at 09:19:41PM +0800, Xiao Guangrong wrote: > > > On 05/29/2013 08:39 PM, Marcelo Tosatti wrote: > > > > On Wed, May 29, 2013 at 11:03:19AM +0800, Xiao Guangrong wrote: > > > >>>>> the pages since other vcpus may be doing locklessly shadow > > > >>>>> page walking > > > >>> > > > >>> Ah, yes, i agree with you. > > > >>> > > > >>> We can introduce a list, say kvm->arch.obsolte_pages, to link all of the > > > >>> zapped-page, the page-shrink will free the page on that list first. > > > >>> > > > >>> Marcelo, if you do not have objection on patch 1 ~ 8 and 11, could you please > > > >>> let them merged first, and do add some comments and tlb optimization later? > > > >> > > > >> Exclude patch 11 please, since it depends on the "collapse" optimization. > > > > > > > > I'm fine with patch 1 being merged. I think the remaining patches need better > > > > understanding or explanation. The problems i see are: > > > > > > > > 1) The magic number "10" to zap before considering reschedule is > > > > annoying. It would be good to understand why it is needed at all. > > > > > > ...... > > > > > > > > > > > But then again, the testcase is measuring kvm_mmu_zap_all performance > > > > alone which we know is not a common operation, so perhaps there is > > > > no need for that minimum-pages-to-zap-before-reschedule. > > > > > > Well. Although, this is not the common operation, but this operation > > > can be triggered by VCPU - it one VCPU take long time on zap-all-pages, > > > other vcpus is missing IPI-synce, or missing IO. This is easily cause > > > soft lockups if the vcpu is doing memslot-releated things. > > > > > +1. If it is trigarable by a guest it may slow down the guest, but we > > should not allow for it to slow down a host. > > > > Well, I don't object to the minimum-pages-to-zap-before-reschedule idea > itself, but if you're going to take patch 4, please at least add a warning > in the changelog that the magic number "10" was selected without good enough > reasoning. > > "[ It improves kernel building 0.6% ~ 1% ]" alone will make it hard for > others to change the number later. > > I actually once tried to do a similar thing for other code. So I have a > possible reasoning for this, and 10 should probably be changed later. > In this case, the solution seems to be very simple: just drop spin_needbreak() and leave need_resched() alone. This way we can guarantee that zap-all will get a fair amount of CPU time for each scheduling from the host scheduler's point of view. Of course this can block other VCPU threads waiting for mmu_lock during that time slice, but should be much better than blocking them for some magical number of zappings. We also need to remember that spin_needbreak() does not do anything for some preempt config settings. Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e676356..5e34056 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -4237,8 +4237,6 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm) restart: list_for_each_entry_safe_reverse(sp, node, &kvm->arch.active_mmu_pages, link) { - int ret; - /* * No obsolete page exists before new created page since * active_mmu_pages is the FIFO list. @@ -4254,21 +4252,24 @@ restart: if (sp->role.invalid) continue; + /* + * Need not flush tlb since we only zap the sp with invalid + * generation number. + */ if (batch >= BATCH_ZAP_PAGES && - (need_resched() || spin_needbreak(&kvm->mmu_lock))) { + cond_resched_lock(&kvm->mmu_lock)) { batch = 0; - kvm_mmu_commit_zap_page(kvm, &invalid_list); - cond_resched_lock(&kvm->mmu_lock); goto restart; } - ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list); - batch += ret; - - if (ret) - goto restart; + batch += kvm_mmu_prepare_zap_obsolete_page(kvm, sp, + &invalid_list); } + /* + * Should flush tlb before free page tables since lockless-walking + * may use the pages. + */ kvm_mmu_commit_zap_page(kvm, &invalid_list); } @@ -4287,6 +4288,17 @@ void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm) trace_kvm_mmu_invalidate_zap_all_pages(kvm); kvm->arch.mmu_valid_gen++; + /* + * Notify all vcpus to reload its shadow page table + * and flush TLB. Then all vcpus will switch to new + * shadow page table with the new mmu_valid_gen. + * + * Note: we should do this under the protection of + * mmu-lock, otherwise, vcpu would purge shadow page + * but miss tlb flush. + */ + kvm_reload_remote_mmus(kvm); + kvm_zap_obsolete_pages(kvm); spin_unlock(&kvm->mmu_lock); }
kvm_zap_obsolete_pages uses lock-break technique to zap pages, it will flush tlb every time when it does lock-break We can reload mmu on all vcpus after updating the generation number so that the obsolete pages are not used on any vcpus, after that we do not need to flush tlb when obsolete pages are zapped Note: kvm_mmu_commit_zap_page is still needed before free the pages since other vcpus may be doing locklessly shadow page walking Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> --- arch/x86/kvm/mmu.c | 32 ++++++++++++++++++++++---------- 1 files changed, 22 insertions(+), 10 deletions(-)