[v4,4/6] KVM: MMU: fast invalid all shadow pages

Message ID	1367032402-13729-5-git-send-email-xiaoguangrong@linux.vnet.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> Gateway: Authorized Use Only! Violators will be prosecuted for <kvm@vger.kernel.org> from <xiaoguangrong@linux.vnet.ibm.com>; Sat, 27 Apr 2013 08:37:52 +0530 Gateway: Authorized Use Only! Violators will be prosecuted; Sat, 27 Apr 2013 08:37:49 +0530 From: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> To: mtosatti@redhat.com Cc: gleb@redhat.com, avi.kivity@gmail.com, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Subject: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages Date: Sat, 27 Apr 2013 11:13:20 +0800 Message-Id: <1367032402-13729-5-git-send-email-xiaoguangrong@linux.vnet.ibm.com> In-Reply-To: <1367032402-13729-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com> References: <1367032402-13729-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Xiao Guangrong April 27, 2013, 3:13 a.m. UTC

The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability.

In this patch, we introduce a faster way to invalid all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created.

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.

The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are keeped in mmu-cache until page allocator reclaims page.

If the invalidation is due to memslot changed, its rmap amd lpage-info
will be freed soon, in order to avoiding use invalid memory, we unmap
all sptes on its rmap and always reset the large-info all memslots so
that rmap and lpage info can be safely freed.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/mmu.c              |   77 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmu.h              |    2 +
 3 files changed, 80 insertions(+), 1 deletions(-)

Marcelo Tosatti May 3, 2013, 1:05 a.m. UTC | #1

On Sat, Apr 27, 2013 at 11:13:20AM +0800, Xiao Guangrong wrote:
> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> walk and zap all shadow pages one by one, also it need to zap all guest
> page's rmap and all shadow page's parent spte list. Particularly, things
> become worse if guest uses more memory or vcpus. It is not good for
> scalability.
> 
> In this patch, we introduce a faster way to invalid all shadow pages.
> KVM maintains a global mmu invalid generation-number which is stored in
> kvm->arch.mmu_valid_gen and every shadow page stores the current global
> generation-number into sp->mmu_valid_gen when it is created.
> 
> When KVM need zap all shadow pages sptes, it just simply increase the
> global generation-number then reload root shadow pages on all vcpus.
> Vcpu will create a new shadow page table according to current kvm's
> generation-number. It ensures the old pages are not used any more.
> 
> The invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> are keeped in mmu-cache until page allocator reclaims page.
> 
> If the invalidation is due to memslot changed, its rmap amd lpage-info
> will be freed soon, in order to avoiding use invalid memory, we unmap
> all sptes on its rmap and always reset the large-info all memslots so
> that rmap and lpage info can be safely freed.
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 +
>  arch/x86/kvm/mmu.c              |   77 ++++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/mmu.h              |    2 +
>  3 files changed, 80 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 18635ae..7adf8f8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -220,6 +220,7 @@ struct kvm_mmu_page {
>  	int root_count;          /* Currently serving as active root */
>  	unsigned int unsync_children;
>  	unsigned long parent_ptes;	/* Reverse mapping for parent_pte */
> +	unsigned long mmu_valid_gen;
>  	DECLARE_BITMAP(unsync_child_bitmap, 512);
>  
>  #ifdef CONFIG_X86_32
> @@ -527,6 +528,7 @@ struct kvm_arch {
>  	unsigned int n_requested_mmu_pages;
>  	unsigned int n_max_mmu_pages;
>  	unsigned int indirect_shadow_pages;
> +	unsigned long mmu_valid_gen;
>  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>  	/*
>  	 * Hash table of struct kvm_mmu_page.
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 004cc87..63110c7 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1838,6 +1838,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  	__clear_sp_write_flooding_count(sp);
>  }
>  
> +static bool is_valid_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> +	return likely(sp->mmu_valid_gen == kvm->arch.mmu_valid_gen);
> +}
> +
>  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  					     gfn_t gfn,
>  					     gva_t gaddr,
> @@ -1864,6 +1869,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  		role.quadrant = quadrant;
>  	}
>  	for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> +		if (!is_valid_sp(vcpu->kvm, sp))
> +			continue;
> +
>  		if (!need_sync && sp->unsync)
>  			need_sync = true;
>  
> @@ -1900,6 +1908,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  
>  		account_shadowed(vcpu->kvm, gfn);
>  	}
> +	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
>  	init_shadow_page_table(sp);
>  	trace_kvm_mmu_get_page(sp, true);
>  	return sp;
> @@ -2070,8 +2079,12 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>  	ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
>  	kvm_mmu_page_unlink_children(kvm, sp);
>  	kvm_mmu_unlink_parents(kvm, sp);
> -	if (!sp->role.invalid && !sp->role.direct)
> +
> +	if (!sp->role.invalid && !sp->role.direct &&
> +	      /* Invalid-gen pages are not accounted. */
> +	      is_valid_sp(kvm, sp))
>  		unaccount_shadowed(kvm, sp->gfn);
> +
>  	if (sp->unsync)
>  		kvm_unlink_unsync_page(kvm, sp);
>  	if (!sp->root_count) {
> @@ -4194,6 +4207,68 @@ restart:
>  	spin_unlock(&kvm->mmu_lock);
>  }
>  
> +static void
> +memslot_unmap_rmaps(struct kvm_memory_slot *slot, struct kvm *kvm)
> +{
> +	int level;
> +
> +	for (level = PT_PAGE_TABLE_LEVEL;
> +	      level < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++level) {
> +		unsigned long idx, *rmapp;
> +
> +		rmapp = slot->arch.rmap[level - PT_PAGE_TABLE_LEVEL];
> +		idx = gfn_to_index(slot->base_gfn + slot->npages - 1,
> +				   slot->base_gfn, level) + 1;
> +
> +		while (idx--) {
> +			kvm_unmap_rmapp(kvm, rmapp + idx, slot, 0);
> +
> +			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> +				cond_resched_lock(&kvm->mmu_lock);
> +		}
> +	}
> +}
> +
> +/*
> + * Fast invalid all shadow pages belong to @slot.
> + *
> + * @slot != NULL means the invalidation is caused the memslot specified
> + * by @slot is being deleted, in this case, we should ensure that rmap
> + * and lpage-info of the @slot can not be used after calling the function.
> + *
> + * @slot == NULL means the invalidation due to other reasons, we need
> + * not care rmap and lpage-info since they are still valid after calling
> + * the function.
> + */
> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot)
> +{
> +	spin_lock(&kvm->mmu_lock);
> +	kvm->arch.mmu_valid_gen++;
> +
> +	/*
> +	 * All shadow paes are invalid, reset the large page info,
> +	 * then we can safely desotry the memslot, it is also good
> +	 * for large page used.
> +	 */
> +	kvm_clear_all_lpage_info(kvm);

Xiao,

I understood it was agreed that simple mmu_lock lockbreak while
avoiding zapping of newly instantiated pages upon a

	if(spin_needbreak)
		cond_resched_lock()

cycle was enough as a first step? And then later introduce root zapping
along with measurements.

https://lkml.org/lkml/2013/4/22/544

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takuya Yoshikawa May 3, 2013, 2:27 a.m. UTC | #2

On Sat, 27 Apr 2013 11:13:20 +0800
Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:

> +/*
> + * Fast invalid all shadow pages belong to @slot.
> + *
> + * @slot != NULL means the invalidation is caused the memslot specified
> + * by @slot is being deleted, in this case, we should ensure that rmap
> + * and lpage-info of the @slot can not be used after calling the function.
> + *
> + * @slot == NULL means the invalidation due to other reasons, we need

The comment should explain what the "other reasons" are.
But this API may better be split into two separate functions; it depends
on the "other reasons".

> + * not care rmap and lpage-info since they are still valid after calling
> + * the function.
> + */
> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> +				   struct kvm_memory_slot *slot)

You yourself is explaining this as "invalidation" in the comment.
kvm_mmu_invalidate_shadow_pages_memslot() or something...

Anybody can think of a better name?

	Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 3, 2013, 5:52 a.m. UTC | #3

On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:

>> +
>> +/*
>> + * Fast invalid all shadow pages belong to @slot.
>> + *
>> + * @slot != NULL means the invalidation is caused the memslot specified
>> + * by @slot is being deleted, in this case, we should ensure that rmap
>> + * and lpage-info of the @slot can not be used after calling the function.
>> + *
>> + * @slot == NULL means the invalidation due to other reasons, we need
>> + * not care rmap and lpage-info since they are still valid after calling
>> + * the function.
>> + */
>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>> +				   struct kvm_memory_slot *slot)
>> +{
>> +	spin_lock(&kvm->mmu_lock);
>> +	kvm->arch.mmu_valid_gen++;
>> +
>> +	/*
>> +	 * All shadow paes are invalid, reset the large page info,
>> +	 * then we can safely desotry the memslot, it is also good
>> +	 * for large page used.
>> +	 */
>> +	kvm_clear_all_lpage_info(kvm);
> 
> Xiao,
> 
> I understood it was agreed that simple mmu_lock lockbreak while
> avoiding zapping of newly instantiated pages upon a
> 
> 	if(spin_needbreak)
> 		cond_resched_lock()
> 
> cycle was enough as a first step? And then later introduce root zapping
> along with measurements.
> 
> https://lkml.org/lkml/2013/4/22/544

Yes, it is.

See the changelog in 0/0:

" we use lock-break technique to zap all sptes linked on the
invalid rmap, it is not very effective but good for the first step."

Thanks!


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 3, 2013, 6 a.m. UTC | #4

On 05/03/2013 10:27 AM, Takuya Yoshikawa wrote:
> On Sat, 27 Apr 2013 11:13:20 +0800
> Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> wrote:
> 
>> +/*
>> + * Fast invalid all shadow pages belong to @slot.
>> + *
>> + * @slot != NULL means the invalidation is caused the memslot specified
>> + * by @slot is being deleted, in this case, we should ensure that rmap
>> + * and lpage-info of the @slot can not be used after calling the function.
>> + *
>> + * @slot == NULL means the invalidation due to other reasons, we need
> 
> The comment should explain what the "other reasons" are.
> But this API may better be split into two separate functions; it depends
> on the "other reasons".

NO.

> 
>> + * not care rmap and lpage-info since they are still valid after calling
>> + * the function.
>> + */
>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>> +				   struct kvm_memory_slot *slot)
> 
> You yourself is explaining this as "invalidation" in the comment.
> kvm_mmu_invalidate_shadow_pages_memslot() or something...

Umm, invalidate is a better name. Will update after collecting Marcelo, Gleb
and other guy's comments.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti May 3, 2013, 3:53 p.m. UTC | #5

On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
> 
> >> +
> >> +/*
> >> + * Fast invalid all shadow pages belong to @slot.
> >> + *
> >> + * @slot != NULL means the invalidation is caused the memslot specified
> >> + * by @slot is being deleted, in this case, we should ensure that rmap
> >> + * and lpage-info of the @slot can not be used after calling the function.
> >> + *
> >> + * @slot == NULL means the invalidation due to other reasons, we need
> >> + * not care rmap and lpage-info since they are still valid after calling
> >> + * the function.
> >> + */
> >> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >> +				   struct kvm_memory_slot *slot)
> >> +{
> >> +	spin_lock(&kvm->mmu_lock);
> >> +	kvm->arch.mmu_valid_gen++;
> >> +
> >> +	/*
> >> +	 * All shadow paes are invalid, reset the large page info,
> >> +	 * then we can safely desotry the memslot, it is also good
> >> +	 * for large page used.
> >> +	 */
> >> +	kvm_clear_all_lpage_info(kvm);
> > 
> > Xiao,
> > 
> > I understood it was agreed that simple mmu_lock lockbreak while
> > avoiding zapping of newly instantiated pages upon a
> > 
> > 	if(spin_needbreak)
> > 		cond_resched_lock()
> > 
> > cycle was enough as a first step? And then later introduce root zapping
> > along with measurements.
> > 
> > https://lkml.org/lkml/2013/4/22/544
> 
> Yes, it is.
> 
> See the changelog in 0/0:
> 
> " we use lock-break technique to zap all sptes linked on the
> invalid rmap, it is not very effective but good for the first step."
> 
> Thanks!

Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
zapping the root? Only lock-break technique along with generation number 
was what was agreed.

That is, having:

> >> +  /*
> >> +   * All shadow paes are invalid, reset the large page info,
> >> +   * then we can safely desotry the memslot, it is also good
> >> +   * for large page used.
> >> +   */
> >> +  kvm_clear_all_lpage_info(kvm);

Was an optimization step that should be done after being shown it is an
advantage?

It is more work, but it leads to a better understanding of the issues in 
practice.

If you have reasons to do it now, then please have it in the final
patches, as an optimization on top of the first patches (where the
lockbreak technique plus generation numbers is introduced).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 3, 2013, 4:51 p.m. UTC | #6

On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
> On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
>> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
>>
>>>> +
>>>> +/*
>>>> + * Fast invalid all shadow pages belong to @slot.
>>>> + *
>>>> + * @slot != NULL means the invalidation is caused the memslot specified
>>>> + * by @slot is being deleted, in this case, we should ensure that rmap
>>>> + * and lpage-info of the @slot can not be used after calling the function.
>>>> + *
>>>> + * @slot == NULL means the invalidation due to other reasons, we need
>>>> + * not care rmap and lpage-info since they are still valid after calling
>>>> + * the function.
>>>> + */
>>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>>>> +				   struct kvm_memory_slot *slot)
>>>> +{
>>>> +	spin_lock(&kvm->mmu_lock);
>>>> +	kvm->arch.mmu_valid_gen++;
>>>> +
>>>> +	/*
>>>> +	 * All shadow paes are invalid, reset the large page info,
>>>> +	 * then we can safely desotry the memslot, it is also good
>>>> +	 * for large page used.
>>>> +	 */
>>>> +	kvm_clear_all_lpage_info(kvm);
>>>
>>> Xiao,
>>>
>>> I understood it was agreed that simple mmu_lock lockbreak while
>>> avoiding zapping of newly instantiated pages upon a
>>>
>>> 	if(spin_needbreak)
>>> 		cond_resched_lock()
>>>
>>> cycle was enough as a first step? And then later introduce root zapping
>>> along with measurements.
>>>
>>> https://lkml.org/lkml/2013/4/22/544
>>
>> Yes, it is.
>>
>> See the changelog in 0/0:
>>
>> " we use lock-break technique to zap all sptes linked on the
>> invalid rmap, it is not very effective but good for the first step."
>>
>> Thanks!
> 
> Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
> zapping the root? Only lock-break technique along with generation number 
> was what was agreed.

Marcelo,

Please Wait... I am completely confused. :(

Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
Are these changes you wanted?

void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
				   struct kvm_memory_slot *slot)
{
	spin_lock(&kvm->mmu_lock);
	kvm->arch.mmu_valid_gen++;

	/* Zero all root pages.*/
restart:
	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
		if (!sp->root_count)
			continue;

		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
			goto restart;
	}

	/*
	 * All shadow paes are invalid, reset the large page info,
	 * then we can safely desotry the memslot, it is also good
	 * for large page used.
	 */
	kvm_clear_all_lpage_info(kvm);

	kvm_mmu_commit_zap_page(kvm, &invalid_list);
	spin_unlock(&kvm->mmu_lock);
}

static void rmap_remove(struct kvm *kvm, u64 *spte)
{
	struct kvm_mmu_page *sp;
	gfn_t gfn;
	unsigned long *rmapp;

	sp = page_header(__pa(spte));
+
+       /* Let invalid sp do not access its rmap. */
+	if (!sp_is_valid(sp))
+		return;
+
	gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
	rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
	pte_list_remove(spte, rmapp);
}

If yes, there is the reason why we can not do this that i mentioned before:

after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
->invalidate_range_start, can not find any spte using the host page, then
Accessed/Dirty for host page is missing tracked.
(missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)

What's your idea?

And I should apologize for my poor communications, really sorry for that...


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti May 4, 2013, 12:52 a.m. UTC | #7

On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
> On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
> > On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
> >> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
> >>
> >>>> +
> >>>> +/*
> >>>> + * Fast invalid all shadow pages belong to @slot.
> >>>> + *
> >>>> + * @slot != NULL means the invalidation is caused the memslot specified
> >>>> + * by @slot is being deleted, in this case, we should ensure that rmap
> >>>> + * and lpage-info of the @slot can not be used after calling the function.
> >>>> + *
> >>>> + * @slot == NULL means the invalidation due to other reasons, we need
> >>>> + * not care rmap and lpage-info since they are still valid after calling
> >>>> + * the function.
> >>>> + */
> >>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >>>> +				   struct kvm_memory_slot *slot)
> >>>> +{
> >>>> +	spin_lock(&kvm->mmu_lock);
> >>>> +	kvm->arch.mmu_valid_gen++;
> >>>> +
> >>>> +	/*
> >>>> +	 * All shadow paes are invalid, reset the large page info,
> >>>> +	 * then we can safely desotry the memslot, it is also good
> >>>> +	 * for large page used.
> >>>> +	 */
> >>>> +	kvm_clear_all_lpage_info(kvm);
> >>>
> >>> Xiao,
> >>>
> >>> I understood it was agreed that simple mmu_lock lockbreak while
> >>> avoiding zapping of newly instantiated pages upon a
> >>>
> >>> 	if(spin_needbreak)
> >>> 		cond_resched_lock()
> >>>
> >>> cycle was enough as a first step? And then later introduce root zapping
> >>> along with measurements.
> >>>
> >>> https://lkml.org/lkml/2013/4/22/544
> >>
> >> Yes, it is.
> >>
> >> See the changelog in 0/0:
> >>
> >> " we use lock-break technique to zap all sptes linked on the
> >> invalid rmap, it is not very effective but good for the first step."
> >>
> >> Thanks!
> > 
> > Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
> > zapping the root? Only lock-break technique along with generation number 
> > was what was agreed.
> 
> Marcelo,
> 
> Please Wait... I am completely confused. :(
> 
> Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
> Are these changes you wanted?
> 
> void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> 				   struct kvm_memory_slot *slot)
> {
> 	spin_lock(&kvm->mmu_lock);
> 	kvm->arch.mmu_valid_gen++;
> 
> 	/* Zero all root pages.*/
> restart:
> 	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
> 		if (!sp->root_count)
> 			continue;
> 
> 		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> 			goto restart;
> 	}
> 
> 	/*
> 	 * All shadow paes are invalid, reset the large page info,
> 	 * then we can safely desotry the memslot, it is also good
> 	 * for large page used.
> 	 */
> 	kvm_clear_all_lpage_info(kvm);
> 
> 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> 	spin_unlock(&kvm->mmu_lock);
> }
> 
> static void rmap_remove(struct kvm *kvm, u64 *spte)
> {
> 	struct kvm_mmu_page *sp;
> 	gfn_t gfn;
> 	unsigned long *rmapp;
> 
> 	sp = page_header(__pa(spte));
> +
> +       /* Let invalid sp do not access its rmap. */
> +	if (!sp_is_valid(sp))
> +		return;
> +
> 	gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
> 	rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
> 	pte_list_remove(spte, rmapp);
> }
> 
> If yes, there is the reason why we can not do this that i mentioned before:
> 
> after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
> Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
> ->invalidate_range_start, can not find any spte using the host page, then
> Accessed/Dirty for host page is missing tracked.
> (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
> 
> What's your idea?


Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
releases mmu_lock and reacquires it again, only shadow pages 
from the generation with which kvm_mmu_zap_all started are zapped (this
guarantees forward progress and eventual termination).

kvm_mmu_zap_generation()
	spin_lock(mmu_lock)
	int generation = kvm->arch.mmu_generation;

	for_each_shadow_page(sp) {
		if (sp->generation == kvm->arch.mmu_generation)
			zap_page(sp)
		if (spin_needbreak(mmu_lock)) {
			kvm->arch.mmu_generation++;
			cond_resched_lock(mmu_lock);
		}
	}

kvm_mmu_zap_all()
	spin_lock(mmu_lock)
	for_each_shadow_page(sp) {
		if (spin_needbreak(mmu_lock)) {
			cond_resched_lock(mmu_lock);
		}
	}

Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.

This addresses the main problem: excessively long hold times 
of kvm_mmu_zap_all with very large guests.

Do you see any problem with this logic? This was what i was thinking 
we agreed.

Step 2) Show that the optimization to zap only the roots is worthwhile
via benchmarking, and implement it.

What do you say?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti May 4, 2013, 12:56 a.m. UTC | #8

On Fri, May 03, 2013 at 09:52:01PM -0300, Marcelo Tosatti wrote:
> On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
> > On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
> > > On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
> > >> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
> > >>
> > >>>> +
> > >>>> +/*
> > >>>> + * Fast invalid all shadow pages belong to @slot.
> > >>>> + *
> > >>>> + * @slot != NULL means the invalidation is caused the memslot specified
> > >>>> + * by @slot is being deleted, in this case, we should ensure that rmap
> > >>>> + * and lpage-info of the @slot can not be used after calling the function.
> > >>>> + *
> > >>>> + * @slot == NULL means the invalidation due to other reasons, we need
> > >>>> + * not care rmap and lpage-info since they are still valid after calling
> > >>>> + * the function.
> > >>>> + */
> > >>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> > >>>> +				   struct kvm_memory_slot *slot)
> > >>>> +{
> > >>>> +	spin_lock(&kvm->mmu_lock);
> > >>>> +	kvm->arch.mmu_valid_gen++;
> > >>>> +
> > >>>> +	/*
> > >>>> +	 * All shadow paes are invalid, reset the large page info,
> > >>>> +	 * then we can safely desotry the memslot, it is also good
> > >>>> +	 * for large page used.
> > >>>> +	 */
> > >>>> +	kvm_clear_all_lpage_info(kvm);
> > >>>
> > >>> Xiao,
> > >>>
> > >>> I understood it was agreed that simple mmu_lock lockbreak while
> > >>> avoiding zapping of newly instantiated pages upon a
> > >>>
> > >>> 	if(spin_needbreak)
> > >>> 		cond_resched_lock()
> > >>>
> > >>> cycle was enough as a first step? And then later introduce root zapping
> > >>> along with measurements.
> > >>>
> > >>> https://lkml.org/lkml/2013/4/22/544
> > >>
> > >> Yes, it is.
> > >>
> > >> See the changelog in 0/0:
> > >>
> > >> " we use lock-break technique to zap all sptes linked on the
> > >> invalid rmap, it is not very effective but good for the first step."
> > >>
> > >> Thanks!
> > > 
> > > Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
> > > zapping the root? Only lock-break technique along with generation number 
> > > was what was agreed.
> > 
> > Marcelo,
> > 
> > Please Wait... I am completely confused. :(
> > 
> > Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
> > Are these changes you wanted?
> > 
> > void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> > 				   struct kvm_memory_slot *slot)
> > {
> > 	spin_lock(&kvm->mmu_lock);
> > 	kvm->arch.mmu_valid_gen++;
> > 
> > 	/* Zero all root pages.*/
> > restart:
> > 	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
> > 		if (!sp->root_count)
> > 			continue;
> > 
> > 		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> > 			goto restart;
> > 	}
> > 
> > 	/*
> > 	 * All shadow paes are invalid, reset the large page info,
> > 	 * then we can safely desotry the memslot, it is also good
> > 	 * for large page used.
> > 	 */
> > 	kvm_clear_all_lpage_info(kvm);
> > 
> > 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > 	spin_unlock(&kvm->mmu_lock);
> > }
> > 
> > static void rmap_remove(struct kvm *kvm, u64 *spte)
> > {
> > 	struct kvm_mmu_page *sp;
> > 	gfn_t gfn;
> > 	unsigned long *rmapp;
> > 
> > 	sp = page_header(__pa(spte));
> > +
> > +       /* Let invalid sp do not access its rmap. */
> > +	if (!sp_is_valid(sp))
> > +		return;
> > +
> > 	gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
> > 	rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
> > 	pte_list_remove(spte, rmapp);
> > }
> > 
> > If yes, there is the reason why we can not do this that i mentioned before:
> > 
> > after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
> > Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
> > ->invalidate_range_start, can not find any spte using the host page, then
> > Accessed/Dirty for host page is missing tracked.
> > (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
> > 
> > What's your idea?
> 
> 
> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
> releases mmu_lock and reacquires it again, only shadow pages 
> from the generation with which kvm_mmu_zap_all started are zapped (this
> guarantees forward progress and eventual termination).
> 
> kvm_mmu_zap_generation()
> 	spin_lock(mmu_lock)
> 	int generation = kvm->arch.mmu_generation;
> 
> 	for_each_shadow_page(sp) {
> 		if (sp->generation == kvm->arch.mmu_generation)
> 			zap_page(sp)
> 		if (spin_needbreak(mmu_lock)) {
> 			kvm->arch.mmu_generation++;
> 			cond_resched_lock(mmu_lock);
> 		}
> 	}
> 
> kvm_mmu_zap_all()
> 	spin_lock(mmu_lock)
> 	for_each_shadow_page(sp) {
> 		if (spin_needbreak(mmu_lock)) {
> 			cond_resched_lock(mmu_lock);
> 		}
> 	}
> 
> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> 
> This addresses the main problem: excessively long hold times 
> of kvm_mmu_zap_all with very large guests.
> 
> Do you see any problem with this logic? This was what i was thinking 
> we agreed.
> 
> Step 2) Show that the optimization to zap only the roots is worthwhile
> via benchmarking, and implement it.
> 
> What do you say?

One concern you had earlier was:

"BTW, to my honest, i do not think spin_needbreak is a good way - it
does not fix the hot-lock contention and it just occupies more cpu time
to avoid possible soft lock-ups.

Especially, zap-all-shadow-pages can let other vcpus fault and vcpus
contest mmu-lock, then zap-all-shadow-pages release mmu-lock and wait,
other vcpus create page tables again. zap-all-shadow-page need long
time to be finished, the worst case is, it can not completed forever on
intensive vcpu and memory usage."

But with generation numbers you can guarantee termination (as long as
new pages are added to one side of the active_mmu_pages list while
kvm_mmu_zap_all begins walking from the other).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 6, 2013, 3:39 a.m. UTC | #9

On 05/04/2013 08:52 AM, Marcelo Tosatti wrote:
> On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
>> On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
>>> On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
>>>> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
>>>>
>>>>>> +
>>>>>> +/*
>>>>>> + * Fast invalid all shadow pages belong to @slot.
>>>>>> + *
>>>>>> + * @slot != NULL means the invalidation is caused the memslot specified
>>>>>> + * by @slot is being deleted, in this case, we should ensure that rmap
>>>>>> + * and lpage-info of the @slot can not be used after calling the function.
>>>>>> + *
>>>>>> + * @slot == NULL means the invalidation due to other reasons, we need
>>>>>> + * not care rmap and lpage-info since they are still valid after calling
>>>>>> + * the function.
>>>>>> + */
>>>>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>>>>>> +				   struct kvm_memory_slot *slot)
>>>>>> +{
>>>>>> +	spin_lock(&kvm->mmu_lock);
>>>>>> +	kvm->arch.mmu_valid_gen++;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * All shadow paes are invalid, reset the large page info,
>>>>>> +	 * then we can safely desotry the memslot, it is also good
>>>>>> +	 * for large page used.
>>>>>> +	 */
>>>>>> +	kvm_clear_all_lpage_info(kvm);
>>>>>
>>>>> Xiao,
>>>>>
>>>>> I understood it was agreed that simple mmu_lock lockbreak while
>>>>> avoiding zapping of newly instantiated pages upon a
>>>>>
>>>>> 	if(spin_needbreak)
>>>>> 		cond_resched_lock()
>>>>>
>>>>> cycle was enough as a first step? And then later introduce root zapping
>>>>> along with measurements.
>>>>>
>>>>> https://lkml.org/lkml/2013/4/22/544
>>>>
>>>> Yes, it is.
>>>>
>>>> See the changelog in 0/0:
>>>>
>>>> " we use lock-break technique to zap all sptes linked on the
>>>> invalid rmap, it is not very effective but good for the first step."
>>>>
>>>> Thanks!
>>>
>>> Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
>>> zapping the root? Only lock-break technique along with generation number 
>>> was what was agreed.
>>
>> Marcelo,
>>
>> Please Wait... I am completely confused. :(
>>
>> Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
>> Are these changes you wanted?
>>
>> void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>> 				   struct kvm_memory_slot *slot)
>> {
>> 	spin_lock(&kvm->mmu_lock);
>> 	kvm->arch.mmu_valid_gen++;
>>
>> 	/* Zero all root pages.*/
>> restart:
>> 	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
>> 		if (!sp->root_count)
>> 			continue;
>>
>> 		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>> 			goto restart;
>> 	}
>>
>> 	/*
>> 	 * All shadow paes are invalid, reset the large page info,
>> 	 * then we can safely desotry the memslot, it is also good
>> 	 * for large page used.
>> 	 */
>> 	kvm_clear_all_lpage_info(kvm);
>>
>> 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
>> 	spin_unlock(&kvm->mmu_lock);
>> }
>>
>> static void rmap_remove(struct kvm *kvm, u64 *spte)
>> {
>> 	struct kvm_mmu_page *sp;
>> 	gfn_t gfn;
>> 	unsigned long *rmapp;
>>
>> 	sp = page_header(__pa(spte));
>> +
>> +       /* Let invalid sp do not access its rmap. */
>> +	if (!sp_is_valid(sp))
>> +		return;
>> +
>> 	gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
>> 	rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
>> 	pte_list_remove(spte, rmapp);
>> }
>>
>> If yes, there is the reason why we can not do this that i mentioned before:
>>
>> after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
>> Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
>> ->invalidate_range_start, can not find any spte using the host page, then
>> Accessed/Dirty for host page is missing tracked.
>> (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
>>
>> What's your idea?
> 
> 
> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
> releases mmu_lock and reacquires it again, only shadow pages 
> from the generation with which kvm_mmu_zap_all started are zapped (this
> guarantees forward progress and eventual termination).
> 
> kvm_mmu_zap_generation()
> 	spin_lock(mmu_lock)
> 	int generation = kvm->arch.mmu_generation;
> 
> 	for_each_shadow_page(sp) {
> 		if (sp->generation == kvm->arch.mmu_generation)
> 			zap_page(sp)
> 		if (spin_needbreak(mmu_lock)) {
> 			kvm->arch.mmu_generation++;
> 			cond_resched_lock(mmu_lock);
> 		}
> 	}
> 
> kvm_mmu_zap_all()
> 	spin_lock(mmu_lock)
> 	for_each_shadow_page(sp) {
> 		if (spin_needbreak(mmu_lock)) {
> 			cond_resched_lock(mmu_lock);
> 		}
> 	}
> 
> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> 
> This addresses the main problem: excessively long hold times 
> of kvm_mmu_zap_all with very large guests.
> 
> Do you see any problem with this logic? This was what i was thinking 
> we agreed.

No. I understand it and it can work.

Actually, it is similar with Gleb's idea that "zapping stale shadow pages
(and uses lock break technique)", after some discussion, we thought "only zap
shadow pages that are reachable from the slot's rmap" is better, that is this
patchset does.
(https://lkml.org/lkml/2013/4/23/73)

> 
> Step 2) Show that the optimization to zap only the roots is worthwhile
> via benchmarking, and implement it.

This is what i am confused. I can not understand how "zap only the roots"
works. You mean these change?

kvm_mmu_zap_generation()
 	spin_lock(mmu_lock)
 	int generation = kvm->arch.mmu_generation;

 	for_each_shadow_page(sp) {
		/* Change here. */
=> 		if ((sp->generation == kvm->arch.mmu_generation) &&
=>		      sp->root_count)
 			zap_page(sp)

 		if (spin_needbreak(mmu_lock)) {
 			kvm->arch.mmu_generation++;
 			cond_resched_lock(mmu_lock);
 		}
 	}

If we do this, there will have shadow pages that are linked to invalid memslot's
rmap. How do we handle these pages and the mmu-notify issue?

Thanks!


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov May 6, 2013, 12:36 p.m. UTC | #10

On Mon, May 06, 2013 at 11:39:11AM +0800, Xiao Guangrong wrote:
> On 05/04/2013 08:52 AM, Marcelo Tosatti wrote:
> > On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
> >> On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
> >>> On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
> >>>> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
> >>>>
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * Fast invalid all shadow pages belong to @slot.
> >>>>>> + *
> >>>>>> + * @slot != NULL means the invalidation is caused the memslot specified
> >>>>>> + * by @slot is being deleted, in this case, we should ensure that rmap
> >>>>>> + * and lpage-info of the @slot can not be used after calling the function.
> >>>>>> + *
> >>>>>> + * @slot == NULL means the invalidation due to other reasons, we need
> >>>>>> + * not care rmap and lpage-info since they are still valid after calling
> >>>>>> + * the function.
> >>>>>> + */
> >>>>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >>>>>> +				   struct kvm_memory_slot *slot)
> >>>>>> +{
> >>>>>> +	spin_lock(&kvm->mmu_lock);
> >>>>>> +	kvm->arch.mmu_valid_gen++;
> >>>>>> +
> >>>>>> +	/*
> >>>>>> +	 * All shadow paes are invalid, reset the large page info,
> >>>>>> +	 * then we can safely desotry the memslot, it is also good
> >>>>>> +	 * for large page used.
> >>>>>> +	 */
> >>>>>> +	kvm_clear_all_lpage_info(kvm);
> >>>>>
> >>>>> Xiao,
> >>>>>
> >>>>> I understood it was agreed that simple mmu_lock lockbreak while
> >>>>> avoiding zapping of newly instantiated pages upon a
> >>>>>
> >>>>> 	if(spin_needbreak)
> >>>>> 		cond_resched_lock()
> >>>>>
> >>>>> cycle was enough as a first step? And then later introduce root zapping
> >>>>> along with measurements.
> >>>>>
> >>>>> https://lkml.org/lkml/2013/4/22/544
> >>>>
> >>>> Yes, it is.
> >>>>
> >>>> See the changelog in 0/0:
> >>>>
> >>>> " we use lock-break technique to zap all sptes linked on the
> >>>> invalid rmap, it is not very effective but good for the first step."
> >>>>
> >>>> Thanks!
> >>>
> >>> Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
> >>> zapping the root? Only lock-break technique along with generation number 
> >>> was what was agreed.
> >>
> >> Marcelo,
> >>
> >> Please Wait... I am completely confused. :(
> >>
> >> Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
> >> Are these changes you wanted?
> >>
> >> void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >> 				   struct kvm_memory_slot *slot)
> >> {
> >> 	spin_lock(&kvm->mmu_lock);
> >> 	kvm->arch.mmu_valid_gen++;
> >>
> >> 	/* Zero all root pages.*/
> >> restart:
> >> 	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
> >> 		if (!sp->root_count)
> >> 			continue;
> >>
> >> 		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> >> 			goto restart;
> >> 	}
> >>
> >> 	/*
> >> 	 * All shadow paes are invalid, reset the large page info,
> >> 	 * then we can safely desotry the memslot, it is also good
> >> 	 * for large page used.
> >> 	 */
> >> 	kvm_clear_all_lpage_info(kvm);
> >>
> >> 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> >> 	spin_unlock(&kvm->mmu_lock);
> >> }
> >>
> >> static void rmap_remove(struct kvm *kvm, u64 *spte)
> >> {
> >> 	struct kvm_mmu_page *sp;
> >> 	gfn_t gfn;
> >> 	unsigned long *rmapp;
> >>
> >> 	sp = page_header(__pa(spte));
> >> +
> >> +       /* Let invalid sp do not access its rmap. */
> >> +	if (!sp_is_valid(sp))
> >> +		return;
> >> +
> >> 	gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
> >> 	rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
> >> 	pte_list_remove(spte, rmapp);
> >> }
> >>
> >> If yes, there is the reason why we can not do this that i mentioned before:
> >>
> >> after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
> >> Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
> >> ->invalidate_range_start, can not find any spte using the host page, then
> >> Accessed/Dirty for host page is missing tracked.
> >> (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
> >>
> >> What's your idea?
> > 
> > 
> > Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> > spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
> > releases mmu_lock and reacquires it again, only shadow pages 
> > from the generation with which kvm_mmu_zap_all started are zapped (this
> > guarantees forward progress and eventual termination).
> > 
> > kvm_mmu_zap_generation()
> > 	spin_lock(mmu_lock)
> > 	int generation = kvm->arch.mmu_generation;
> > 
> > 	for_each_shadow_page(sp) {
> > 		if (sp->generation == kvm->arch.mmu_generation)
> > 			zap_page(sp)
> > 		if (spin_needbreak(mmu_lock)) {
> > 			kvm->arch.mmu_generation++;
> > 			cond_resched_lock(mmu_lock);
> > 		}
> > 	}
> > 
> > kvm_mmu_zap_all()
> > 	spin_lock(mmu_lock)
> > 	for_each_shadow_page(sp) {
> > 		if (spin_needbreak(mmu_lock)) {
> > 			cond_resched_lock(mmu_lock);
> > 		}
> > 	}
> > 
> > Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> > Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> > 
> > This addresses the main problem: excessively long hold times 
> > of kvm_mmu_zap_all with very large guests.
> > 
> > Do you see any problem with this logic? This was what i was thinking 
> > we agreed.
> 
> No. I understand it and it can work.
> 
> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> (and uses lock break technique)", after some discussion, we thought "only zap
> shadow pages that are reachable from the slot's rmap" is better, that is this
> patchset does.
> (https://lkml.org/lkml/2013/4/23/73)
> 
But this is not what the patch is doing. Close, but not the same :)
Instead of zapping shadow pages reachable from slot's rmap the patch
does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
That is why you need special code to re-init lpage_info. What I proposed
was to call zap_page() on all shadow pages reachable from rmap. This
will take care of lpage_info counters. Does this make sense?

> > 
> > Step 2) Show that the optimization to zap only the roots is worthwhile
> > via benchmarking, and implement it.
> 
> This is what i am confused. I can not understand how "zap only the roots"
> works. You mean these change?
> 
> kvm_mmu_zap_generation()
>  	spin_lock(mmu_lock)
>  	int generation = kvm->arch.mmu_generation;
> 
>  	for_each_shadow_page(sp) {
> 		/* Change here. */
> => 		if ((sp->generation == kvm->arch.mmu_generation) &&
> =>		      sp->root_count)
>  			zap_page(sp)
> 
>  		if (spin_needbreak(mmu_lock)) {
>  			kvm->arch.mmu_generation++;
>  			cond_resched_lock(mmu_lock);
>  		}
>  	}
> 
> If we do this, there will have shadow pages that are linked to invalid memslot's
> rmap. How do we handle these pages and the mmu-notify issue?
> 
> Thanks!
> 

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 6, 2013, 1:10 p.m. UTC | #11

On 05/06/2013 08:36 PM, Gleb Natapov wrote:

>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
>>> releases mmu_lock and reacquires it again, only shadow pages 
>>> from the generation with which kvm_mmu_zap_all started are zapped (this
>>> guarantees forward progress and eventual termination).
>>>
>>> kvm_mmu_zap_generation()
>>> 	spin_lock(mmu_lock)
>>> 	int generation = kvm->arch.mmu_generation;
>>>
>>> 	for_each_shadow_page(sp) {
>>> 		if (sp->generation == kvm->arch.mmu_generation)
>>> 			zap_page(sp)
>>> 		if (spin_needbreak(mmu_lock)) {
>>> 			kvm->arch.mmu_generation++;
>>> 			cond_resched_lock(mmu_lock);
>>> 		}
>>> 	}
>>>
>>> kvm_mmu_zap_all()
>>> 	spin_lock(mmu_lock)
>>> 	for_each_shadow_page(sp) {
>>> 		if (spin_needbreak(mmu_lock)) {
>>> 			cond_resched_lock(mmu_lock);
>>> 		}
>>> 	}
>>>
>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
>>>
>>> This addresses the main problem: excessively long hold times 
>>> of kvm_mmu_zap_all with very large guests.
>>>
>>> Do you see any problem with this logic? This was what i was thinking 
>>> we agreed.
>>
>> No. I understand it and it can work.
>>
>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
>> (and uses lock break technique)", after some discussion, we thought "only zap
>> shadow pages that are reachable from the slot's rmap" is better, that is this
>> patchset does.
>> (https://lkml.org/lkml/2013/4/23/73)
>>
> But this is not what the patch is doing. Close, but not the same :)

Okay. :)

> Instead of zapping shadow pages reachable from slot's rmap the patch
> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> That is why you need special code to re-init lpage_info. What I proposed
> was to call zap_page() on all shadow pages reachable from rmap. This
> will take care of lpage_info counters. Does this make sense?

Unfortunately, no! We still need to care lpage_info. lpage_info is used
to count the number of guest page tables in the memslot.

For example, there is a memslot:
memslot[0].based_gfn = 0, memslot[0].npages = 100,

and there is a shadow page:
sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.

this sp is counted in the memslot[0] but it can not be found by walking
memslot[0]->rmap since there is no last mapping in this shadow page.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov May 6, 2013, 5:24 p.m. UTC | #12

On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
> 
> >>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> >>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
> >>> releases mmu_lock and reacquires it again, only shadow pages 
> >>> from the generation with which kvm_mmu_zap_all started are zapped (this
> >>> guarantees forward progress and eventual termination).
> >>>
> >>> kvm_mmu_zap_generation()
> >>> 	spin_lock(mmu_lock)
> >>> 	int generation = kvm->arch.mmu_generation;
> >>>
> >>> 	for_each_shadow_page(sp) {
> >>> 		if (sp->generation == kvm->arch.mmu_generation)
> >>> 			zap_page(sp)
> >>> 		if (spin_needbreak(mmu_lock)) {
> >>> 			kvm->arch.mmu_generation++;
> >>> 			cond_resched_lock(mmu_lock);
> >>> 		}
> >>> 	}
> >>>
> >>> kvm_mmu_zap_all()
> >>> 	spin_lock(mmu_lock)
> >>> 	for_each_shadow_page(sp) {
> >>> 		if (spin_needbreak(mmu_lock)) {
> >>> 			cond_resched_lock(mmu_lock);
> >>> 		}
> >>> 	}
> >>>
> >>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> >>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> >>>
> >>> This addresses the main problem: excessively long hold times 
> >>> of kvm_mmu_zap_all with very large guests.
> >>>
> >>> Do you see any problem with this logic? This was what i was thinking 
> >>> we agreed.
> >>
> >> No. I understand it and it can work.
> >>
> >> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> >> (and uses lock break technique)", after some discussion, we thought "only zap
> >> shadow pages that are reachable from the slot's rmap" is better, that is this
> >> patchset does.
> >> (https://lkml.org/lkml/2013/4/23/73)
> >>
> > But this is not what the patch is doing. Close, but not the same :)
> 
> Okay. :)
> 
> > Instead of zapping shadow pages reachable from slot's rmap the patch
> > does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> > That is why you need special code to re-init lpage_info. What I proposed
> > was to call zap_page() on all shadow pages reachable from rmap. This
> > will take care of lpage_info counters. Does this make sense?
> 
> Unfortunately, no! We still need to care lpage_info. lpage_info is used
> to count the number of guest page tables in the memslot.
> 
> For example, there is a memslot:
> memslot[0].based_gfn = 0, memslot[0].npages = 100,
> 
> and there is a shadow page:
> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
> 
> this sp is counted in the memslot[0] but it can not be found by walking
> memslot[0]->rmap since there is no last mapping in this shadow page.
> 
Right, so what about walking mmu_page_hash for each gfn belonging to the
slot that is in process to be removed to find those?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 6, 2013, 5:45 p.m. UTC | #13

On 05/07/2013 01:24 AM, Gleb Natapov wrote:
> On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
>> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
>>
>>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
>>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
>>>>> releases mmu_lock and reacquires it again, only shadow pages 
>>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
>>>>> guarantees forward progress and eventual termination).
>>>>>
>>>>> kvm_mmu_zap_generation()
>>>>> 	spin_lock(mmu_lock)
>>>>> 	int generation = kvm->arch.mmu_generation;
>>>>>
>>>>> 	for_each_shadow_page(sp) {
>>>>> 		if (sp->generation == kvm->arch.mmu_generation)
>>>>> 			zap_page(sp)
>>>>> 		if (spin_needbreak(mmu_lock)) {
>>>>> 			kvm->arch.mmu_generation++;
>>>>> 			cond_resched_lock(mmu_lock);
>>>>> 		}
>>>>> 	}
>>>>>
>>>>> kvm_mmu_zap_all()
>>>>> 	spin_lock(mmu_lock)
>>>>> 	for_each_shadow_page(sp) {
>>>>> 		if (spin_needbreak(mmu_lock)) {
>>>>> 			cond_resched_lock(mmu_lock);
>>>>> 		}
>>>>> 	}
>>>>>
>>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
>>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
>>>>>
>>>>> This addresses the main problem: excessively long hold times 
>>>>> of kvm_mmu_zap_all with very large guests.
>>>>>
>>>>> Do you see any problem with this logic? This was what i was thinking 
>>>>> we agreed.
>>>>
>>>> No. I understand it and it can work.
>>>>
>>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
>>>> (and uses lock break technique)", after some discussion, we thought "only zap
>>>> shadow pages that are reachable from the slot's rmap" is better, that is this
>>>> patchset does.
>>>> (https://lkml.org/lkml/2013/4/23/73)
>>>>
>>> But this is not what the patch is doing. Close, but not the same :)
>>
>> Okay. :)
>>
>>> Instead of zapping shadow pages reachable from slot's rmap the patch
>>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
>>> That is why you need special code to re-init lpage_info. What I proposed
>>> was to call zap_page() on all shadow pages reachable from rmap. This
>>> will take care of lpage_info counters. Does this make sense?
>>
>> Unfortunately, no! We still need to care lpage_info. lpage_info is used
>> to count the number of guest page tables in the memslot.
>>
>> For example, there is a memslot:
>> memslot[0].based_gfn = 0, memslot[0].npages = 100,
>>
>> and there is a shadow page:
>> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
>>
>> this sp is counted in the memslot[0] but it can not be found by walking
>> memslot[0]->rmap since there is no last mapping in this shadow page.
>>
> Right, so what about walking mmu_page_hash for each gfn belonging to the
> slot that is in process to be removed to find those?

That will cost lots of time. The size of hashtable is 1 << 10. If the
memslot has 4M memory, it will walk all the entries, the cost is the same
as walking active_list (maybe litter more). And a memslot has 4M memory is
the normal case i think.

Another point is that lpage_info stops mmu to use large page. If we
do not reset lpage_info, mmu is using 4K page until the invalid-sp is
zapped.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti May 6, 2013, 7:50 p.m. UTC | #14

On Mon, May 06, 2013 at 11:39:11AM +0800, Xiao Guangrong wrote:
> On 05/04/2013 08:52 AM, Marcelo Tosatti wrote:
> > On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
> >> On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
> >>> On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
> >>>> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
> >>>>
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * Fast invalid all shadow pages belong to @slot.
> >>>>>> + *
> >>>>>> + * @slot != NULL means the invalidation is caused the memslot specified
> >>>>>> + * by @slot is being deleted, in this case, we should ensure that rmap
> >>>>>> + * and lpage-info of the @slot can not be used after calling the function.
> >>>>>> + *
> >>>>>> + * @slot == NULL means the invalidation due to other reasons, we need
> >>>>>> + * not care rmap and lpage-info since they are still valid after calling
> >>>>>> + * the function.
> >>>>>> + */
> >>>>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >>>>>> +				   struct kvm_memory_slot *slot)
> >>>>>> +{
> >>>>>> +	spin_lock(&kvm->mmu_lock);
> >>>>>> +	kvm->arch.mmu_valid_gen++;
> >>>>>> +
> >>>>>> +	/*
> >>>>>> +	 * All shadow paes are invalid, reset the large page info,
> >>>>>> +	 * then we can safely desotry the memslot, it is also good
> >>>>>> +	 * for large page used.
> >>>>>> +	 */
> >>>>>> +	kvm_clear_all_lpage_info(kvm);
> >>>>>
> >>>>> Xiao,
> >>>>>
> >>>>> I understood it was agreed that simple mmu_lock lockbreak while
> >>>>> avoiding zapping of newly instantiated pages upon a
> >>>>>
> >>>>> 	if(spin_needbreak)
> >>>>> 		cond_resched_lock()
> >>>>>
> >>>>> cycle was enough as a first step? And then later introduce root zapping
> >>>>> along with measurements.
> >>>>>
> >>>>> https://lkml.org/lkml/2013/4/22/544
> >>>>
> >>>> Yes, it is.
> >>>>
> >>>> See the changelog in 0/0:
> >>>>
> >>>> " we use lock-break technique to zap all sptes linked on the
> >>>> invalid rmap, it is not very effective but good for the first step."
> >>>>
> >>>> Thanks!
> >>>
> >>> Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
> >>> zapping the root? Only lock-break technique along with generation number 
> >>> was what was agreed.
> >>
> >> Marcelo,
> >>
> >> Please Wait... I am completely confused. :(
> >>
> >> Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
> >> Are these changes you wanted?
> >>
> >> void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
> >> 				   struct kvm_memory_slot *slot)
> >> {
> >> 	spin_lock(&kvm->mmu_lock);
> >> 	kvm->arch.mmu_valid_gen++;
> >>
> >> 	/* Zero all root pages.*/
> >> restart:
> >> 	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
> >> 		if (!sp->root_count)
> >> 			continue;
> >>
> >> 		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> >> 			goto restart;
> >> 	}
> >>
> >> 	/*
> >> 	 * All shadow paes are invalid, reset the large page info,
> >> 	 * then we can safely desotry the memslot, it is also good
> >> 	 * for large page used.
> >> 	 */
> >> 	kvm_clear_all_lpage_info(kvm);
> >>
> >> 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> >> 	spin_unlock(&kvm->mmu_lock);
> >> }
> >>
> >> static void rmap_remove(struct kvm *kvm, u64 *spte)
> >> {
> >> 	struct kvm_mmu_page *sp;
> >> 	gfn_t gfn;
> >> 	unsigned long *rmapp;
> >>
> >> 	sp = page_header(__pa(spte));
> >> +
> >> +       /* Let invalid sp do not access its rmap. */
> >> +	if (!sp_is_valid(sp))
> >> +		return;
> >> +
> >> 	gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
> >> 	rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
> >> 	pte_list_remove(spte, rmapp);
> >> }
> >>
> >> If yes, there is the reason why we can not do this that i mentioned before:
> >>
> >> after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
> >> Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
> >> ->invalidate_range_start, can not find any spte using the host page, then
> >> Accessed/Dirty for host page is missing tracked.
> >> (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
> >>
> >> What's your idea?
> > 
> > 
> > Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> > spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
> > releases mmu_lock and reacquires it again, only shadow pages 
> > from the generation with which kvm_mmu_zap_all started are zapped (this
> > guarantees forward progress and eventual termination).
> > 
> > kvm_mmu_zap_generation()
> > 	spin_lock(mmu_lock)
> > 	int generation = kvm->arch.mmu_generation;
> > 
> > 	for_each_shadow_page(sp) {
> > 		if (sp->generation == kvm->arch.mmu_generation)
> > 			zap_page(sp)
> > 		if (spin_needbreak(mmu_lock)) {
> > 			kvm->arch.mmu_generation++;
> > 			cond_resched_lock(mmu_lock);
> > 		}
> > 	}
> > 
> > kvm_mmu_zap_all()
> > 	spin_lock(mmu_lock)
> > 	for_each_shadow_page(sp) {
> > 		if (spin_needbreak(mmu_lock)) {
> > 			cond_resched_lock(mmu_lock);
> > 		}
> > 	}
> > 
> > Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> > Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> > 
> > This addresses the main problem: excessively long hold times 
> > of kvm_mmu_zap_all with very large guests.
> > 
> > Do you see any problem with this logic? This was what i was thinking 
> > we agreed.
> 
> No. I understand it and it can work.
> 
> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> (and uses lock break technique)", after some discussion, we thought "only zap
> shadow pages that are reachable from the slot's rmap" is better, that is this
> patchset does.
> (https://lkml.org/lkml/2013/4/23/73)
> 
> > 
> > Step 2) Show that the optimization to zap only the roots is worthwhile
> > via benchmarking, and implement it.
> 
> This is what i am confused. I can not understand how "zap only the roots"
> works. You mean these change?
> 
> kvm_mmu_zap_generation()
>  	spin_lock(mmu_lock)
>  	int generation = kvm->arch.mmu_generation;
> 
>  	for_each_shadow_page(sp) {
> 		/* Change here. */
> => 		if ((sp->generation == kvm->arch.mmu_generation) &&
> =>		      sp->root_count)
>  			zap_page(sp)
> 
>  		if (spin_needbreak(mmu_lock)) {
>  			kvm->arch.mmu_generation++;
>  			cond_resched_lock(mmu_lock);
>  		}
>  	}
> 
> If we do this, there will have shadow pages that are linked to invalid memslot's
> rmap. How do we handle these pages and the mmu-notify issue?
> 
> Thanks!

By "zap only roots" i mean zapping roots plus generation number on
shadow pages. But this as a second step, after it has been demonstrated
its worthwhile.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 7, 2013, 3:39 a.m. UTC | #15

On 05/07/2013 03:50 AM, Marcelo Tosatti wrote:
> On Mon, May 06, 2013 at 11:39:11AM +0800, Xiao Guangrong wrote:
>> On 05/04/2013 08:52 AM, Marcelo Tosatti wrote:
>>> On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
>>>> On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
>>>>> On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
>>>>>> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
>>>>>>
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * Fast invalid all shadow pages belong to @slot.
>>>>>>>> + *
>>>>>>>> + * @slot != NULL means the invalidation is caused the memslot specified
>>>>>>>> + * by @slot is being deleted, in this case, we should ensure that rmap
>>>>>>>> + * and lpage-info of the @slot can not be used after calling the function.
>>>>>>>> + *
>>>>>>>> + * @slot == NULL means the invalidation due to other reasons, we need
>>>>>>>> + * not care rmap and lpage-info since they are still valid after calling
>>>>>>>> + * the function.
>>>>>>>> + */
>>>>>>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>>>>>>>> +				   struct kvm_memory_slot *slot)
>>>>>>>> +{
>>>>>>>> +	spin_lock(&kvm->mmu_lock);
>>>>>>>> +	kvm->arch.mmu_valid_gen++;
>>>>>>>> +
>>>>>>>> +	/*
>>>>>>>> +	 * All shadow paes are invalid, reset the large page info,
>>>>>>>> +	 * then we can safely desotry the memslot, it is also good
>>>>>>>> +	 * for large page used.
>>>>>>>> +	 */
>>>>>>>> +	kvm_clear_all_lpage_info(kvm);
>>>>>>>
>>>>>>> Xiao,
>>>>>>>
>>>>>>> I understood it was agreed that simple mmu_lock lockbreak while
>>>>>>> avoiding zapping of newly instantiated pages upon a
>>>>>>>
>>>>>>> 	if(spin_needbreak)
>>>>>>> 		cond_resched_lock()
>>>>>>>
>>>>>>> cycle was enough as a first step? And then later introduce root zapping
>>>>>>> along with measurements.
>>>>>>>
>>>>>>> https://lkml.org/lkml/2013/4/22/544
>>>>>>
>>>>>> Yes, it is.
>>>>>>
>>>>>> See the changelog in 0/0:
>>>>>>
>>>>>> " we use lock-break technique to zap all sptes linked on the
>>>>>> invalid rmap, it is not very effective but good for the first step."
>>>>>>
>>>>>> Thanks!
>>>>>
>>>>> Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
>>>>> zapping the root? Only lock-break technique along with generation number 
>>>>> was what was agreed.
>>>>
>>>> Marcelo,
>>>>
>>>> Please Wait... I am completely confused. :(
>>>>
>>>> Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
>>>> Are these changes you wanted?
>>>>
>>>> void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>>>> 				   struct kvm_memory_slot *slot)
>>>> {
>>>> 	spin_lock(&kvm->mmu_lock);
>>>> 	kvm->arch.mmu_valid_gen++;
>>>>
>>>> 	/* Zero all root pages.*/
>>>> restart:
>>>> 	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
>>>> 		if (!sp->root_count)
>>>> 			continue;
>>>>
>>>> 		if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>>>> 			goto restart;
>>>> 	}
>>>>
>>>> 	/*
>>>> 	 * All shadow paes are invalid, reset the large page info,
>>>> 	 * then we can safely desotry the memslot, it is also good
>>>> 	 * for large page used.
>>>> 	 */
>>>> 	kvm_clear_all_lpage_info(kvm);
>>>>
>>>> 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
>>>> 	spin_unlock(&kvm->mmu_lock);
>>>> }
>>>>
>>>> static void rmap_remove(struct kvm *kvm, u64 *spte)
>>>> {
>>>> 	struct kvm_mmu_page *sp;
>>>> 	gfn_t gfn;
>>>> 	unsigned long *rmapp;
>>>>
>>>> 	sp = page_header(__pa(spte));
>>>> +
>>>> +       /* Let invalid sp do not access its rmap. */
>>>> +	if (!sp_is_valid(sp))
>>>> +		return;
>>>> +
>>>> 	gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
>>>> 	rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
>>>> 	pte_list_remove(spte, rmapp);
>>>> }
>>>>
>>>> If yes, there is the reason why we can not do this that i mentioned before:
>>>>
>>>> after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
>>>> Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
>>>> ->invalidate_range_start, can not find any spte using the host page, then
>>>> Accessed/Dirty for host page is missing tracked.
>>>> (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
>>>>
>>>> What's your idea?
>>>
>>>
>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
>>> releases mmu_lock and reacquires it again, only shadow pages 
>>> from the generation with which kvm_mmu_zap_all started are zapped (this
>>> guarantees forward progress and eventual termination).
>>>
>>> kvm_mmu_zap_generation()
>>> 	spin_lock(mmu_lock)
>>> 	int generation = kvm->arch.mmu_generation;
>>>
>>> 	for_each_shadow_page(sp) {
>>> 		if (sp->generation == kvm->arch.mmu_generation)
>>> 			zap_page(sp)
>>> 		if (spin_needbreak(mmu_lock)) {
>>> 			kvm->arch.mmu_generation++;
>>> 			cond_resched_lock(mmu_lock);
>>> 		}
>>> 	}
>>>
>>> kvm_mmu_zap_all()
>>> 	spin_lock(mmu_lock)
>>> 	for_each_shadow_page(sp) {
>>> 		if (spin_needbreak(mmu_lock)) {
>>> 			cond_resched_lock(mmu_lock);
>>> 		}
>>> 	}
>>>
>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
>>>
>>> This addresses the main problem: excessively long hold times 
>>> of kvm_mmu_zap_all with very large guests.
>>>
>>> Do you see any problem with this logic? This was what i was thinking 
>>> we agreed.
>>
>> No. I understand it and it can work.
>>
>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
>> (and uses lock break technique)", after some discussion, we thought "only zap
>> shadow pages that are reachable from the slot's rmap" is better, that is this
>> patchset does.
>> (https://lkml.org/lkml/2013/4/23/73)
>>
>>>
>>> Step 2) Show that the optimization to zap only the roots is worthwhile
>>> via benchmarking, and implement it.
>>
>> This is what i am confused. I can not understand how "zap only the roots"
>> works. You mean these change?
>>
>> kvm_mmu_zap_generation()
>>  	spin_lock(mmu_lock)
>>  	int generation = kvm->arch.mmu_generation;
>>
>>  	for_each_shadow_page(sp) {
>> 		/* Change here. */
>> => 		if ((sp->generation == kvm->arch.mmu_generation) &&
>> =>		      sp->root_count)
>>  			zap_page(sp)
>>
>>  		if (spin_needbreak(mmu_lock)) {
>>  			kvm->arch.mmu_generation++;
>>  			cond_resched_lock(mmu_lock);
>>  		}
>>  	}
>>
>> If we do this, there will have shadow pages that are linked to invalid memslot's
>> rmap. How do we handle these pages and the mmu-notify issue?
>>
>> Thanks!
> 
> By "zap only roots" i mean zapping roots plus generation number on
> shadow pages. But this as a second step, after it has been demonstrated
> its worthwhile.

Marcelo,

Sorry for my stupidity, still do not understand. Could you please show me the
pseudocode and answer my questions above?





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov May 7, 2013, 8:58 a.m. UTC | #16

On Tue, May 07, 2013 at 01:45:52AM +0800, Xiao Guangrong wrote:
> On 05/07/2013 01:24 AM, Gleb Natapov wrote:
> > On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
> >> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
> >>
> >>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> >>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
> >>>>> releases mmu_lock and reacquires it again, only shadow pages 
> >>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
> >>>>> guarantees forward progress and eventual termination).
> >>>>>
> >>>>> kvm_mmu_zap_generation()
> >>>>> 	spin_lock(mmu_lock)
> >>>>> 	int generation = kvm->arch.mmu_generation;
> >>>>>
> >>>>> 	for_each_shadow_page(sp) {
> >>>>> 		if (sp->generation == kvm->arch.mmu_generation)
> >>>>> 			zap_page(sp)
> >>>>> 		if (spin_needbreak(mmu_lock)) {
> >>>>> 			kvm->arch.mmu_generation++;
> >>>>> 			cond_resched_lock(mmu_lock);
> >>>>> 		}
> >>>>> 	}
> >>>>>
> >>>>> kvm_mmu_zap_all()
> >>>>> 	spin_lock(mmu_lock)
> >>>>> 	for_each_shadow_page(sp) {
> >>>>> 		if (spin_needbreak(mmu_lock)) {
> >>>>> 			cond_resched_lock(mmu_lock);
> >>>>> 		}
> >>>>> 	}
> >>>>>
> >>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> >>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> >>>>>
> >>>>> This addresses the main problem: excessively long hold times 
> >>>>> of kvm_mmu_zap_all with very large guests.
> >>>>>
> >>>>> Do you see any problem with this logic? This was what i was thinking 
> >>>>> we agreed.
> >>>>
> >>>> No. I understand it and it can work.
> >>>>
> >>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> >>>> (and uses lock break technique)", after some discussion, we thought "only zap
> >>>> shadow pages that are reachable from the slot's rmap" is better, that is this
> >>>> patchset does.
> >>>> (https://lkml.org/lkml/2013/4/23/73)
> >>>>
> >>> But this is not what the patch is doing. Close, but not the same :)
> >>
> >> Okay. :)
> >>
> >>> Instead of zapping shadow pages reachable from slot's rmap the patch
> >>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> >>> That is why you need special code to re-init lpage_info. What I proposed
> >>> was to call zap_page() on all shadow pages reachable from rmap. This
> >>> will take care of lpage_info counters. Does this make sense?
> >>
> >> Unfortunately, no! We still need to care lpage_info. lpage_info is used
> >> to count the number of guest page tables in the memslot.
> >>
> >> For example, there is a memslot:
> >> memslot[0].based_gfn = 0, memslot[0].npages = 100,
> >>
> >> and there is a shadow page:
> >> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
> >>
> >> this sp is counted in the memslot[0] but it can not be found by walking
> >> memslot[0]->rmap since there is no last mapping in this shadow page.
> >>
> > Right, so what about walking mmu_page_hash for each gfn belonging to the
> > slot that is in process to be removed to find those?
> 
> That will cost lots of time. The size of hashtable is 1 << 10. If the
> memslot has 4M memory, it will walk all the entries, the cost is the same
> as walking active_list (maybe litter more). And a memslot has 4M memory is
> the normal case i think.
> 
Memslots will be much bigger with memory hotplug. Lock break should be
used while walking mmu_page_hash obviously, but still iterating over
entire memslot gfn space to find a few gfn that may be there is
suboptimal. We can keep a list of them in the memslot itself.

> Another point is that lpage_info stops mmu to use large page. If we
> do not reset lpage_info, mmu is using 4K page until the invalid-sp is
> zapped.
> 
I do not think this is a big issue. If lpage_info prevented the use of
large pages for some memory ranges before we zapped entire shadow pages
it was probably for a reason, so new shadow page will prevent large
pages from been created for the same memory ranges.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Xiao Guangrong May 7, 2013, 9:41 a.m. UTC | #17

On 05/07/2013 04:58 PM, Gleb Natapov wrote:
> On Tue, May 07, 2013 at 01:45:52AM +0800, Xiao Guangrong wrote:
>> On 05/07/2013 01:24 AM, Gleb Natapov wrote:
>>> On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
>>>> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
>>>>
>>>>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
>>>>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
>>>>>>> releases mmu_lock and reacquires it again, only shadow pages 
>>>>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
>>>>>>> guarantees forward progress and eventual termination).
>>>>>>>
>>>>>>> kvm_mmu_zap_generation()
>>>>>>> 	spin_lock(mmu_lock)
>>>>>>> 	int generation = kvm->arch.mmu_generation;
>>>>>>>
>>>>>>> 	for_each_shadow_page(sp) {
>>>>>>> 		if (sp->generation == kvm->arch.mmu_generation)
>>>>>>> 			zap_page(sp)
>>>>>>> 		if (spin_needbreak(mmu_lock)) {
>>>>>>> 			kvm->arch.mmu_generation++;
>>>>>>> 			cond_resched_lock(mmu_lock);
>>>>>>> 		}
>>>>>>> 	}
>>>>>>>
>>>>>>> kvm_mmu_zap_all()
>>>>>>> 	spin_lock(mmu_lock)
>>>>>>> 	for_each_shadow_page(sp) {
>>>>>>> 		if (spin_needbreak(mmu_lock)) {
>>>>>>> 			cond_resched_lock(mmu_lock);
>>>>>>> 		}
>>>>>>> 	}
>>>>>>>
>>>>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
>>>>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
>>>>>>>
>>>>>>> This addresses the main problem: excessively long hold times 
>>>>>>> of kvm_mmu_zap_all with very large guests.
>>>>>>>
>>>>>>> Do you see any problem with this logic? This was what i was thinking 
>>>>>>> we agreed.
>>>>>>
>>>>>> No. I understand it and it can work.
>>>>>>
>>>>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
>>>>>> (and uses lock break technique)", after some discussion, we thought "only zap
>>>>>> shadow pages that are reachable from the slot's rmap" is better, that is this
>>>>>> patchset does.
>>>>>> (https://lkml.org/lkml/2013/4/23/73)
>>>>>>
>>>>> But this is not what the patch is doing. Close, but not the same :)
>>>>
>>>> Okay. :)
>>>>
>>>>> Instead of zapping shadow pages reachable from slot's rmap the patch
>>>>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
>>>>> That is why you need special code to re-init lpage_info. What I proposed
>>>>> was to call zap_page() on all shadow pages reachable from rmap. This
>>>>> will take care of lpage_info counters. Does this make sense?
>>>>
>>>> Unfortunately, no! We still need to care lpage_info. lpage_info is used
>>>> to count the number of guest page tables in the memslot.
>>>>
>>>> For example, there is a memslot:
>>>> memslot[0].based_gfn = 0, memslot[0].npages = 100,
>>>>
>>>> and there is a shadow page:
>>>> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
>>>>
>>>> this sp is counted in the memslot[0] but it can not be found by walking
>>>> memslot[0]->rmap since there is no last mapping in this shadow page.
>>>>
>>> Right, so what about walking mmu_page_hash for each gfn belonging to the
>>> slot that is in process to be removed to find those?
>>
>> That will cost lots of time. The size of hashtable is 1 << 10. If the
>> memslot has 4M memory, it will walk all the entries, the cost is the same
>> as walking active_list (maybe litter more). And a memslot has 4M memory is
>> the normal case i think.
>>
> Memslots will be much bigger with memory hotplug. Lock break should be
> used while walking mmu_page_hash obviously, but still iterating over
> entire memslot gfn space to find a few gfn that may be there is
> suboptimal. We can keep a list of them in the memslot itself.

It sounds good to me.

BTW, this approach looks more complex and use more memory (new list_head
added into every shadow page) used, why you dislike clearing lpage_info? ;)

> 
>> Another point is that lpage_info stops mmu to use large page. If we
>> do not reset lpage_info, mmu is using 4K page until the invalid-sp is
>> zapped.
>>
> I do not think this is a big issue. If lpage_info prevented the use of
> large pages for some memory ranges before we zapped entire shadow pages
> it was probably for a reason, so new shadow page will prevent large
> pages from been created for the same memory ranges.

Still worried, but I will try it if Marcelo does not have objects.
Thanks a lot for your valuable suggestion, Gleb!

Now, i am trying my best to catch Marcelo's idea of "zapping root
pages", but......



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov May 7, 2013, 10 a.m. UTC | #18

On Tue, May 07, 2013 at 05:41:35PM +0800, Xiao Guangrong wrote:
> On 05/07/2013 04:58 PM, Gleb Natapov wrote:
> > On Tue, May 07, 2013 at 01:45:52AM +0800, Xiao Guangrong wrote:
> >> On 05/07/2013 01:24 AM, Gleb Natapov wrote:
> >>> On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
> >>>> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
> >>>>
> >>>>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> >>>>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
> >>>>>>> releases mmu_lock and reacquires it again, only shadow pages 
> >>>>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
> >>>>>>> guarantees forward progress and eventual termination).
> >>>>>>>
> >>>>>>> kvm_mmu_zap_generation()
> >>>>>>> 	spin_lock(mmu_lock)
> >>>>>>> 	int generation = kvm->arch.mmu_generation;
> >>>>>>>
> >>>>>>> 	for_each_shadow_page(sp) {
> >>>>>>> 		if (sp->generation == kvm->arch.mmu_generation)
> >>>>>>> 			zap_page(sp)
> >>>>>>> 		if (spin_needbreak(mmu_lock)) {
> >>>>>>> 			kvm->arch.mmu_generation++;
> >>>>>>> 			cond_resched_lock(mmu_lock);
> >>>>>>> 		}
> >>>>>>> 	}
> >>>>>>>
> >>>>>>> kvm_mmu_zap_all()
> >>>>>>> 	spin_lock(mmu_lock)
> >>>>>>> 	for_each_shadow_page(sp) {
> >>>>>>> 		if (spin_needbreak(mmu_lock)) {
> >>>>>>> 			cond_resched_lock(mmu_lock);
> >>>>>>> 		}
> >>>>>>> 	}
> >>>>>>>
> >>>>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> >>>>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> >>>>>>>
> >>>>>>> This addresses the main problem: excessively long hold times 
> >>>>>>> of kvm_mmu_zap_all with very large guests.
> >>>>>>>
> >>>>>>> Do you see any problem with this logic? This was what i was thinking 
> >>>>>>> we agreed.
> >>>>>>
> >>>>>> No. I understand it and it can work.
> >>>>>>
> >>>>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> >>>>>> (and uses lock break technique)", after some discussion, we thought "only zap
> >>>>>> shadow pages that are reachable from the slot's rmap" is better, that is this
> >>>>>> patchset does.
> >>>>>> (https://lkml.org/lkml/2013/4/23/73)
> >>>>>>
> >>>>> But this is not what the patch is doing. Close, but not the same :)
> >>>>
> >>>> Okay. :)
> >>>>
> >>>>> Instead of zapping shadow pages reachable from slot's rmap the patch
> >>>>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> >>>>> That is why you need special code to re-init lpage_info. What I proposed
> >>>>> was to call zap_page() on all shadow pages reachable from rmap. This
> >>>>> will take care of lpage_info counters. Does this make sense?
> >>>>
> >>>> Unfortunately, no! We still need to care lpage_info. lpage_info is used
> >>>> to count the number of guest page tables in the memslot.
> >>>>
> >>>> For example, there is a memslot:
> >>>> memslot[0].based_gfn = 0, memslot[0].npages = 100,
> >>>>
> >>>> and there is a shadow page:
> >>>> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
> >>>>
> >>>> this sp is counted in the memslot[0] but it can not be found by walking
> >>>> memslot[0]->rmap since there is no last mapping in this shadow page.
> >>>>
> >>> Right, so what about walking mmu_page_hash for each gfn belonging to the
> >>> slot that is in process to be removed to find those?
> >>
> >> That will cost lots of time. The size of hashtable is 1 << 10. If the
> >> memslot has 4M memory, it will walk all the entries, the cost is the same
> >> as walking active_list (maybe litter more). And a memslot has 4M memory is
> >> the normal case i think.
> >>
> > Memslots will be much bigger with memory hotplug. Lock break should be
> > used while walking mmu_page_hash obviously, but still iterating over
> > entire memslot gfn space to find a few gfn that may be there is
> > suboptimal. We can keep a list of them in the memslot itself.
> 
> It sounds good to me.
> 
> BTW, this approach looks more complex and use more memory (new list_head
> added into every shadow page) used, why you dislike clearing lpage_info? ;)
> 
Looks a little bit hackish, but now that I see we do not have easy way
to find all shadow pages counted in lpage_info I am not entirely against
it. If you convince Marcelo that clearing lpage_info like that is a good
idea I may reconsider. I think, regardless of tracking lpage_info,
having a way to find all shadow pages that reference a memslot is a good
thing though.

> > 
> >> Another point is that lpage_info stops mmu to use large page. If we
> >> do not reset lpage_info, mmu is using 4K page until the invalid-sp is
> >> zapped.
> >>
> > I do not think this is a big issue. If lpage_info prevented the use of
> > large pages for some memory ranges before we zapped entire shadow pages
> > it was probably for a reason, so new shadow page will prevent large
> > pages from been created for the same memory ranges.
> 
> Still worried, but I will try it if Marcelo does not have objects.
> Thanks a lot for your valuable suggestion, Gleb!
> 
> Now, i am trying my best to catch Marcelo's idea of "zapping root
> pages", but......
> 
Yes, I am missing what Marcelo means there too. We cannot free memslot
until we unmap its rmap one way or the other.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti May 7, 2013, 2:33 p.m. UTC | #19

On Tue, May 07, 2013 at 01:00:51PM +0300, Gleb Natapov wrote:
> On Tue, May 07, 2013 at 05:41:35PM +0800, Xiao Guangrong wrote:
> > On 05/07/2013 04:58 PM, Gleb Natapov wrote:
> > > On Tue, May 07, 2013 at 01:45:52AM +0800, Xiao Guangrong wrote:
> > >> On 05/07/2013 01:24 AM, Gleb Natapov wrote:
> > >>> On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
> > >>>> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
> > >>>>
> > >>>>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> > >>>>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
> > >>>>>>> releases mmu_lock and reacquires it again, only shadow pages 
> > >>>>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
> > >>>>>>> guarantees forward progress and eventual termination).
> > >>>>>>>
> > >>>>>>> kvm_mmu_zap_generation()
> > >>>>>>> 	spin_lock(mmu_lock)
> > >>>>>>> 	int generation = kvm->arch.mmu_generation;
> > >>>>>>>
> > >>>>>>> 	for_each_shadow_page(sp) {
> > >>>>>>> 		if (sp->generation == kvm->arch.mmu_generation)
> > >>>>>>> 			zap_page(sp)
> > >>>>>>> 		if (spin_needbreak(mmu_lock)) {
> > >>>>>>> 			kvm->arch.mmu_generation++;
> > >>>>>>> 			cond_resched_lock(mmu_lock);
> > >>>>>>> 		}
> > >>>>>>> 	}
> > >>>>>>>
> > >>>>>>> kvm_mmu_zap_all()
> > >>>>>>> 	spin_lock(mmu_lock)
> > >>>>>>> 	for_each_shadow_page(sp) {
> > >>>>>>> 		if (spin_needbreak(mmu_lock)) {
> > >>>>>>> 			cond_resched_lock(mmu_lock);
> > >>>>>>> 		}
> > >>>>>>> 	}
> > >>>>>>>
> > >>>>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> > >>>>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> > >>>>>>>
> > >>>>>>> This addresses the main problem: excessively long hold times 
> > >>>>>>> of kvm_mmu_zap_all with very large guests.
> > >>>>>>>
> > >>>>>>> Do you see any problem with this logic? This was what i was thinking 
> > >>>>>>> we agreed.
> > >>>>>>
> > >>>>>> No. I understand it and it can work.
> > >>>>>>
> > >>>>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> > >>>>>> (and uses lock break technique)", after some discussion, we thought "only zap
> > >>>>>> shadow pages that are reachable from the slot's rmap" is better, that is this
> > >>>>>> patchset does.
> > >>>>>> (https://lkml.org/lkml/2013/4/23/73)
> > >>>>>>
> > >>>>> But this is not what the patch is doing. Close, but not the same :)
> > >>>>
> > >>>> Okay. :)
> > >>>>
> > >>>>> Instead of zapping shadow pages reachable from slot's rmap the patch
> > >>>>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> > >>>>> That is why you need special code to re-init lpage_info. What I proposed
> > >>>>> was to call zap_page() on all shadow pages reachable from rmap. This
> > >>>>> will take care of lpage_info counters. Does this make sense?
> > >>>>
> > >>>> Unfortunately, no! We still need to care lpage_info. lpage_info is used
> > >>>> to count the number of guest page tables in the memslot.
> > >>>>
> > >>>> For example, there is a memslot:
> > >>>> memslot[0].based_gfn = 0, memslot[0].npages = 100,
> > >>>>
> > >>>> and there is a shadow page:
> > >>>> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
> > >>>>
> > >>>> this sp is counted in the memslot[0] but it can not be found by walking
> > >>>> memslot[0]->rmap since there is no last mapping in this shadow page.
> > >>>>
> > >>> Right, so what about walking mmu_page_hash for each gfn belonging to the
> > >>> slot that is in process to be removed to find those?
> > >>
> > >> That will cost lots of time. The size of hashtable is 1 << 10. If the
> > >> memslot has 4M memory, it will walk all the entries, the cost is the same
> > >> as walking active_list (maybe litter more). And a memslot has 4M memory is
> > >> the normal case i think.
> > >>
> > > Memslots will be much bigger with memory hotplug. Lock break should be
> > > used while walking mmu_page_hash obviously, but still iterating over
> > > entire memslot gfn space to find a few gfn that may be there is
> > > suboptimal. We can keep a list of them in the memslot itself.
> > 
> > It sounds good to me.
> > 
> > BTW, this approach looks more complex and use more memory (new list_head
> > added into every shadow page) used, why you dislike clearing lpage_info? ;)
> > 
> Looks a little bit hackish, but now that I see we do not have easy way
> to find all shadow pages counted in lpage_info I am not entirely against
> it. If you convince Marcelo that clearing lpage_info like that is a good
> idea I may reconsider. I think, regardless of tracking lpage_info,
> having a way to find all shadow pages that reference a memslot is a good
> thing though.
> 
> > > 
> > >> Another point is that lpage_info stops mmu to use large page. If we
> > >> do not reset lpage_info, mmu is using 4K page until the invalid-sp is
> > >> zapped.
> > >>
> > > I do not think this is a big issue. If lpage_info prevented the use of
> > > large pages for some memory ranges before we zapped entire shadow pages
> > > it was probably for a reason, so new shadow page will prevent large
> > > pages from been created for the same memory ranges.
> > 
> > Still worried, but I will try it if Marcelo does not have objects.
> > Thanks a lot for your valuable suggestion, Gleb!
> > 
> > Now, i am trying my best to catch Marcelo's idea of "zapping root
> > pages", but......
> > 
> Yes, I am missing what Marcelo means there too. We cannot free memslot
> until we unmap its rmap one way or the other.

I do not understand what are you optimizing for, given the four possible
cases we discussed at

https://lkml.org/lkml/2013/4/18/280

That is, why a simple for_each_all_shadow_page(zap_page) is not sufficient.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti May 7, 2013, 2:42 p.m. UTC | #20

On Tue, May 07, 2013 at 11:39:59AM +0800, Xiao Guangrong wrote:
> >>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> >>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
> >>> releases mmu_lock and reacquires it again, only shadow pages 
> >>> from the generation with which kvm_mmu_zap_all started are zapped (this
> >>> guarantees forward progress and eventual termination).
> >>>
> >>> kvm_mmu_zap_generation()
> >>> 	spin_lock(mmu_lock)
> >>> 	int generation = kvm->arch.mmu_generation;
> >>>
> >>> 	for_each_shadow_page(sp) {
> >>> 		if (sp->generation == kvm->arch.mmu_generation)
> >>> 			zap_page(sp)
> >>> 		if (spin_needbreak(mmu_lock)) {
> >>> 			kvm->arch.mmu_generation++;
> >>> 			cond_resched_lock(mmu_lock);
> >>> 		}
> >>> 	}
> >>>
> >>> kvm_mmu_zap_all()
> >>> 	spin_lock(mmu_lock)
> >>> 	for_each_shadow_page(sp) {
> >>> 		if (spin_needbreak(mmu_lock)) {
> >>> 			cond_resched_lock(mmu_lock);
> >>> 		}
> >>> 	}
> >>>
> >>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> >>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> >>>
> >>> This addresses the main problem: excessively long hold times 
> >>> of kvm_mmu_zap_all with very large guests.
> >>>
> >>> Do you see any problem with this logic? This was what i was thinking 
> >>> we agreed.
> >>
> >> No. I understand it and it can work.
> >>
> >> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> >> (and uses lock break technique)", after some discussion, we thought "only zap
> >> shadow pages that are reachable from the slot's rmap" is better, that is this
> >> patchset does.
> >> (https://lkml.org/lkml/2013/4/23/73)
> >>
> >>>
> >>> Step 2) Show that the optimization to zap only the roots is worthwhile
> >>> via benchmarking, and implement it.
> >>
> >> This is what i am confused. I can not understand how "zap only the roots"
> >> works. You mean these change?
> >>
> >> kvm_mmu_zap_generation()
> >>  	spin_lock(mmu_lock)
> >>  	int generation = kvm->arch.mmu_generation;
> >>
> >>  	for_each_shadow_page(sp) {
> >> 		/* Change here. */
> >> => 		if ((sp->generation == kvm->arch.mmu_generation) &&
> >> =>		      sp->root_count)
> >>  			zap_page(sp)
> >>
> >>  		if (spin_needbreak(mmu_lock)) {
> >>  			kvm->arch.mmu_generation++;
> >>  			cond_resched_lock(mmu_lock);
> >>  		}
> >>  	}
> >>
> >> If we do this, there will have shadow pages that are linked to invalid memslot's
> >> rmap. How do we handle these pages and the mmu-notify issue?

No, this is a full kvm_mmu_zap_page().

In step 2, after demonstrating and understanding kvm_mmu_zap_page()'s inefficiency (which
we are not certain about, given the four use cases of slot
deletion/move/create), use something smarter than plain
kvm_mmu_zap_page.

> >> Thanks!
> > 
> > By "zap only roots" i mean zapping roots plus generation number on
> > shadow pages. But this as a second step, after it has been demonstrated
> > its worthwhile.
> 
> Marcelo,
> 
> Sorry for my stupidity, still do not understand. Could you please show me the
> pseudocode and answer my questions above?

Hopefully its clear now?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov May 7, 2013, 2:56 p.m. UTC | #21

On Tue, May 07, 2013 at 11:33:29AM -0300, Marcelo Tosatti wrote:
> On Tue, May 07, 2013 at 01:00:51PM +0300, Gleb Natapov wrote:
> > On Tue, May 07, 2013 at 05:41:35PM +0800, Xiao Guangrong wrote:
> > > On 05/07/2013 04:58 PM, Gleb Natapov wrote:
> > > > On Tue, May 07, 2013 at 01:45:52AM +0800, Xiao Guangrong wrote:
> > > >> On 05/07/2013 01:24 AM, Gleb Natapov wrote:
> > > >>> On Mon, May 06, 2013 at 09:10:11PM +0800, Xiao Guangrong wrote:
> > > >>>> On 05/06/2013 08:36 PM, Gleb Natapov wrote:
> > > >>>>
> > > >>>>>>> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> > > >>>>>>> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all 
> > > >>>>>>> releases mmu_lock and reacquires it again, only shadow pages 
> > > >>>>>>> from the generation with which kvm_mmu_zap_all started are zapped (this
> > > >>>>>>> guarantees forward progress and eventual termination).
> > > >>>>>>>
> > > >>>>>>> kvm_mmu_zap_generation()
> > > >>>>>>> 	spin_lock(mmu_lock)
> > > >>>>>>> 	int generation = kvm->arch.mmu_generation;
> > > >>>>>>>
> > > >>>>>>> 	for_each_shadow_page(sp) {
> > > >>>>>>> 		if (sp->generation == kvm->arch.mmu_generation)
> > > >>>>>>> 			zap_page(sp)
> > > >>>>>>> 		if (spin_needbreak(mmu_lock)) {
> > > >>>>>>> 			kvm->arch.mmu_generation++;
> > > >>>>>>> 			cond_resched_lock(mmu_lock);
> > > >>>>>>> 		}
> > > >>>>>>> 	}
> > > >>>>>>>
> > > >>>>>>> kvm_mmu_zap_all()
> > > >>>>>>> 	spin_lock(mmu_lock)
> > > >>>>>>> 	for_each_shadow_page(sp) {
> > > >>>>>>> 		if (spin_needbreak(mmu_lock)) {
> > > >>>>>>> 			cond_resched_lock(mmu_lock);
> > > >>>>>>> 		}
> > > >>>>>>> 	}
> > > >>>>>>>
> > > >>>>>>> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> > > >>>>>>> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
> > > >>>>>>>
> > > >>>>>>> This addresses the main problem: excessively long hold times 
> > > >>>>>>> of kvm_mmu_zap_all with very large guests.
> > > >>>>>>>
> > > >>>>>>> Do you see any problem with this logic? This was what i was thinking 
> > > >>>>>>> we agreed.
> > > >>>>>>
> > > >>>>>> No. I understand it and it can work.
> > > >>>>>>
> > > >>>>>> Actually, it is similar with Gleb's idea that "zapping stale shadow pages
> > > >>>>>> (and uses lock break technique)", after some discussion, we thought "only zap
> > > >>>>>> shadow pages that are reachable from the slot's rmap" is better, that is this
> > > >>>>>> patchset does.
> > > >>>>>> (https://lkml.org/lkml/2013/4/23/73)
> > > >>>>>>
> > > >>>>> But this is not what the patch is doing. Close, but not the same :)
> > > >>>>
> > > >>>> Okay. :)
> > > >>>>
> > > >>>>> Instead of zapping shadow pages reachable from slot's rmap the patch
> > > >>>>> does kvm_unmap_rmapp() which drop all spte without zapping shadow pages.
> > > >>>>> That is why you need special code to re-init lpage_info. What I proposed
> > > >>>>> was to call zap_page() on all shadow pages reachable from rmap. This
> > > >>>>> will take care of lpage_info counters. Does this make sense?
> > > >>>>
> > > >>>> Unfortunately, no! We still need to care lpage_info. lpage_info is used
> > > >>>> to count the number of guest page tables in the memslot.
> > > >>>>
> > > >>>> For example, there is a memslot:
> > > >>>> memslot[0].based_gfn = 0, memslot[0].npages = 100,
> > > >>>>
> > > >>>> and there is a shadow page:
> > > >>>> sp->role.direct =0, sp->role.level = 4, sp->gfn = 10.
> > > >>>>
> > > >>>> this sp is counted in the memslot[0] but it can not be found by walking
> > > >>>> memslot[0]->rmap since there is no last mapping in this shadow page.
> > > >>>>
> > > >>> Right, so what about walking mmu_page_hash for each gfn belonging to the
> > > >>> slot that is in process to be removed to find those?
> > > >>
> > > >> That will cost lots of time. The size of hashtable is 1 << 10. If the
> > > >> memslot has 4M memory, it will walk all the entries, the cost is the same
> > > >> as walking active_list (maybe litter more). And a memslot has 4M memory is
> > > >> the normal case i think.
> > > >>
> > > > Memslots will be much bigger with memory hotplug. Lock break should be
> > > > used while walking mmu_page_hash obviously, but still iterating over
> > > > entire memslot gfn space to find a few gfn that may be there is
> > > > suboptimal. We can keep a list of them in the memslot itself.
> > > 
> > > It sounds good to me.
> > > 
> > > BTW, this approach looks more complex and use more memory (new list_head
> > > added into every shadow page) used, why you dislike clearing lpage_info? ;)
> > > 
> > Looks a little bit hackish, but now that I see we do not have easy way
> > to find all shadow pages counted in lpage_info I am not entirely against
> > it. If you convince Marcelo that clearing lpage_info like that is a good
> > idea I may reconsider. I think, regardless of tracking lpage_info,
> > having a way to find all shadow pages that reference a memslot is a good
> > thing though.
> > 
> > > > 
> > > >> Another point is that lpage_info stops mmu to use large page. If we
> > > >> do not reset lpage_info, mmu is using 4K page until the invalid-sp is
> > > >> zapped.
> > > >>
> > > > I do not think this is a big issue. If lpage_info prevented the use of
> > > > large pages for some memory ranges before we zapped entire shadow pages
> > > > it was probably for a reason, so new shadow page will prevent large
> > > > pages from been created for the same memory ranges.
> > > 
> > > Still worried, but I will try it if Marcelo does not have objects.
> > > Thanks a lot for your valuable suggestion, Gleb!
> > > 
> > > Now, i am trying my best to catch Marcelo's idea of "zapping root
> > > pages", but......
> > > 
> > Yes, I am missing what Marcelo means there too. We cannot free memslot
> > until we unmap its rmap one way or the other.
> 
> I do not understand what are you optimizing for, given the four possible
> cases we discussed at
> 
> https://lkml.org/lkml/2013/4/18/280
> 
We are optimizing mmu_lock holding time for all of those cases.

But you cannot just "zap roots + sp gen number increase." on slot
deletion because you need to transfer access/dirty information from rmap
that is going to be deleted to actual page before
kvm_set_memory_region() returns to a caller.

> That is, why a simple for_each_all_shadow_page(zap_page) is not sufficient.
With a lock break? It is. We tried to optimize that by zapping only pages
that reference memslot that is going to be deleted and zap all other
later when recycling old sps, but if you think this is premature
optimization I am fine with it.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti May 7, 2013, 3:09 p.m. UTC | #22

On Tue, May 07, 2013 at 05:56:08PM +0300, Gleb Natapov wrote:
> > > Yes, I am missing what Marcelo means there too. We cannot free memslot
> > > until we unmap its rmap one way or the other.
> > 
> > I do not understand what are you optimizing for, given the four possible
> > cases we discussed at
> > 
> > https://lkml.org/lkml/2013/4/18/280
> > 
> We are optimizing mmu_lock holding time for all of those cases.
> 
> But you cannot just "zap roots + sp gen number increase." on slot
> deletion because you need to transfer access/dirty information from rmap
> that is going to be deleted to actual page before
> kvm_set_memory_region() returns to a caller.
> 
> > That is, why a simple for_each_all_shadow_page(zap_page) is not sufficient.
> With a lock break? It is. We tried to optimize that by zapping only pages
> that reference memslot that is going to be deleted and zap all other
> later when recycling old sps, but if you think this is premature
> optimization I am fine with it.

If it can be shown that its not premature optimization, I am fine with
it.

AFAICS all cases are 1) rare and 2) not latency sensitive (as in there
is no requirement for those cases to finish in a short period of time).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov May 8, 2013, 10:41 a.m. UTC | #23

On Tue, May 07, 2013 at 12:09:29PM -0300, Marcelo Tosatti wrote:
> On Tue, May 07, 2013 at 05:56:08PM +0300, Gleb Natapov wrote:
> > > > Yes, I am missing what Marcelo means there too. We cannot free memslot
> > > > until we unmap its rmap one way or the other.
> > > 
> > > I do not understand what are you optimizing for, given the four possible
> > > cases we discussed at
> > > 
> > > https://lkml.org/lkml/2013/4/18/280
> > > 
> > We are optimizing mmu_lock holding time for all of those cases.
> > 
> > But you cannot just "zap roots + sp gen number increase." on slot
> > deletion because you need to transfer access/dirty information from rmap
> > that is going to be deleted to actual page before
> > kvm_set_memory_region() returns to a caller.
> > 
> > > That is, why a simple for_each_all_shadow_page(zap_page) is not sufficient.
> > With a lock break? It is. We tried to optimize that by zapping only pages
> > that reference memslot that is going to be deleted and zap all other
> > later when recycling old sps, but if you think this is premature
> > optimization I am fine with it.
> 
> If it can be shown that its not premature optimization, I am fine with
> it.
> 
> AFAICS all cases are 1) rare and 2) not latency sensitive (as in there
> is no requirement for those cases to finish in a short period of time).
OK, lets start from a simple version. The one that goes through rmap
turned out to be more complicated that we expected.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v4,4/6] KVM: MMU: fast invalid all shadow pages

Commit Message

Comments

Patch