cr3 OOS optimisation breaks 32-bit GNU/kFreeBSD guest

Message ID	20090403214548.GA5394@amt.cnet (mailing list archive)
State	Not Applicable
Headers	show Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n33LknPa013343 for <patchwork-kvm@patchwork.kernel.org>; Fri, 3 Apr 2009 21:46:50 GMT Date: Fri, 3 Apr 2009 18:45:48 -0300 From: Marcelo Tosatti <mtosatti@redhat.com> To: Avi Kivity <avi@redhat.com> Cc: Aurelien Jarno <aurelien@aurel32.net>, kvm@vger.kernel.org Subject: Re: cr3 OOS optimisation breaks 32-bit GNU/kFreeBSD guest Message-ID: <20090403214548.GA5394@amt.cnet> References: <20090223003305.GW12976@hall.aurel32.net> <20090320231405.GA26415@amt.cnet> <49C60644.2090904@redhat.com> <20090323172725.GA28775@amt.cnet> <49C8AC35.3030803@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <49C8AC35.3030803@redhat.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: kvm-owner@vger.kernel.org Precedence: bulk

Marcelo Tosatti April 3, 2009, 9:45 p.m. UTC

On Tue, Mar 24, 2009 at 11:47:33AM +0200, Avi Kivity wrote:
>> index 2ea8262..48169d7 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3109,6 +3109,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
>>  			kvm_write_guest_time(vcpu);
>>  		if (test_and_clear_bit(KVM_REQ_MMU_SYNC, &vcpu->requests))
>>  			kvm_mmu_sync_roots(vcpu);
>> +		if (test_and_clear_bit(KVM_REQ_MMU_GLOBAL_SYNC, &vcpu->requests))
>> +			kvm_mmu_sync_global(vcpu);
>>  		if (test_and_clear_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests))
>>  			kvm_x86_ops->tlb_flush(vcpu);
>>  		if (test_and_clear_bit(KVM_REQ_REPORT_TPR_ACCESS
>
> Windows will (I think) write a PDE on every context switch, so this  
> effectively disables global unsync for that guest.
>
> What about recursively syncing the newly linked page in FNAME(fetch)()?  
> If the page isn't global, this becomes a no-op, so no new overhead.  The  
> only question is the expense when linking a populated top-level page,  
> especially in long mode.

How about this?

KVM: MMU: sync global pages on fetch()

If an unsync global page becomes unreachable via the shadow tree, which
can happen if one its parent pages is zapped, invlpg will fail to
invalidate translations for gvas contained in such unreachable pages.

So sync global pages in fetch().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity April 4, 2009, 10:37 a.m. UTC | #1

Marcelo Tosatti wrote:
> On Tue, Mar 24, 2009 at 11:47:33AM +0200, Avi Kivity wrote:
>   
>>> index 2ea8262..48169d7 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3109,6 +3109,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
>>>  			kvm_write_guest_time(vcpu);
>>>  		if (test_and_clear_bit(KVM_REQ_MMU_SYNC, &vcpu->requests))
>>>  			kvm_mmu_sync_roots(vcpu);
>>> +		if (test_and_clear_bit(KVM_REQ_MMU_GLOBAL_SYNC, &vcpu->requests))
>>> +			kvm_mmu_sync_global(vcpu);
>>>  		if (test_and_clear_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests))
>>>  			kvm_x86_ops->tlb_flush(vcpu);
>>>  		if (test_and_clear_bit(KVM_REQ_REPORT_TPR_ACCESS
>>>       
>> Windows will (I think) write a PDE on every context switch, so this  
>> effectively disables global unsync for that guest.
>>
>> What about recursively syncing the newly linked page in FNAME(fetch)()?  
>> If the page isn't global, this becomes a no-op, so no new overhead.  The  
>> only question is the expense when linking a populated top-level page,  
>> especially in long mode.
>>     
>
> How about this?
>
> KVM: MMU: sync global pages on fetch()
>
> If an unsync global page becomes unreachable via the shadow tree, which
> can happen if one its parent pages is zapped, invlpg will fail to
> invalidate translations for gvas contained in such unreachable pages.
>
> So sync global pages in fetch().
>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 09782a9..728be72 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -308,8 +308,14 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>  			break;
>  		}
>  
> -		if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
> +		if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
> +			if (level-1 == PT_PAGE_TABLE_LEVEL) {
> +				shadow_page = page_header(__pa(sptep));
> +				if (shadow_page->unsync && shadow_page->global)
> +					kvm_sync_page(vcpu, shadow_page);
> +			}
>  			continue;
> +		}
>  
>  		if (is_large_pte(*sptep)) {
>  			rmap_remove(vcpu->kvm, sptep);
>   

But here the shadow page is already linked?  Isn't the root cause that 
an invlpg was called when the page wasn't linked, so it wasn't seen by 
invlpg?

So I thought the best place would be in fetch(), after 
kvm_mmu_get_page().  If we're linking a page which contains global ptes, 
they might be unsynced due to invlpgs that we've missed.

Or am I missing something about the root cause?

Marcelo Tosatti April 4, 2009, 5:01 p.m. UTC | #2

On Sat, Apr 04, 2009 at 01:37:39PM +0300, Avi Kivity wrote:
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index 09782a9..728be72 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -308,8 +308,14 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
>>  			break;
>>  		}
>>  -		if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
>> +		if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) {
>> +			if (level-1 == PT_PAGE_TABLE_LEVEL) {
>> +				shadow_page = page_header(__pa(sptep));
>> +				if (shadow_page->unsync && shadow_page->global)
>> +					kvm_sync_page(vcpu, shadow_page);
>> +			}
>>  			continue;
>> +		}
>>   		if (is_large_pte(*sptep)) {
>>  			rmap_remove(vcpu->kvm, sptep);
>>   
>
> But here the shadow page is already linked?  Isn't the root cause that  
> an invlpg was called when the page wasn't linked, so it wasn't seen by  
> invlpg?
>
> So I thought the best place would be in fetch(), after  
> kvm_mmu_get_page().  If we're linking a page which contains global ptes,  
> they might be unsynced due to invlpgs that we've missed.
>
> Or am I missing something about the root cause?

The problem is when the page is unreachable due to a higher level path
being unlinked. Say:

level 4 -> level 3 . level 2 -> level 1 (global unsync)

The dot there means level 3 is not linked to level 2, so invlpg can't
reach the global unsync at level 1.

kvm_mmu_get_page does sync pages when it finds them, so the code is
already safe for the "linking a page which contains global ptes" case
you mention above.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Aurelien Jarno April 4, 2009, 11:23 p.m. UTC | #3

On Fri, Apr 03, 2009 at 06:45:48PM -0300, Marcelo Tosatti wrote:
> On Tue, Mar 24, 2009 at 11:47:33AM +0200, Avi Kivity wrote:
> >> index 2ea8262..48169d7 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -3109,6 +3109,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
> >>  			kvm_write_guest_time(vcpu);
> >>  		if (test_and_clear_bit(KVM_REQ_MMU_SYNC, &vcpu->requests))
> >>  			kvm_mmu_sync_roots(vcpu);
> >> +		if (test_and_clear_bit(KVM_REQ_MMU_GLOBAL_SYNC, &vcpu->requests))
> >> +			kvm_mmu_sync_global(vcpu);
> >>  		if (test_and_clear_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests))
> >>  			kvm_x86_ops->tlb_flush(vcpu);
> >>  		if (test_and_clear_bit(KVM_REQ_REPORT_TPR_ACCESS
> >
> > Windows will (I think) write a PDE on every context switch, so this  
> > effectively disables global unsync for that guest.
> >
> > What about recursively syncing the newly linked page in FNAME(fetch)()?  
> > If the page isn't global, this becomes a no-op, so no new overhead.  The  
> > only question is the expense when linking a populated top-level page,  
> > especially in long mode.
> 
> How about this?
> 
> KVM: MMU: sync global pages on fetch()
> 
> If an unsync global page becomes unreachable via the shadow tree, which
> can happen if one its parent pages is zapped, invlpg will fail to
> invalidate translations for gvas contained in such unreachable pages.
> 
> So sync global pages in fetch().
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

I have tried this patch, and unfortunately it does not solve the
original problem, while the previous one did.

Avi Kivity April 5, 2009, 8:41 a.m. UTC | #4

Marcelo Tosatti wrote:
> The problem is when the page is unreachable due to a higher level path
> being unlinked. Say:
>
> level 4 -> level 3 . level 2 -> level 1 (global unsync)
>
> The dot there means level 3 is not linked to level 2, so invlpg can't
> reach the global unsync at level 1.
>
> kvm_mmu_get_page does sync pages when it finds them, so the code is
> already safe for the "linking a page which contains global ptes" case
> you mention above.
>   

The recursive resync ignores global pages (or it would hit them on cr3 
switch too).

But we have a bigger problem, invlpg can miss even if nothing is unlinked:

   access address x through global pte
   -> pte instantiated into spte
   switch to cr3 where x is not mapped, or mapped differently
   write to pte
   -> no fault since pte is unsync
   invlpg x
   -> no way this can hit the spte
   switch cr3 back
   access address x
   -> use old spte

Here's one way to make this work:

  - add a hash of global pagetables, indexed by virtual address instead 
of the pagetable's gfn
  - invlpg checks this hash in addition to the recursive walk

We'd need to make the virtual address part of sp->role to avoid needing 
to link the same page multiple times in the virtual address hash.

Marcelo Tosatti April 5, 2009, 11:29 a.m. UTC | #5

On Sun, Apr 05, 2009 at 11:41:39AM +0300, Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> The problem is when the page is unreachable due to a higher level path
>> being unlinked. Say:
>>
>> level 4 -> level 3 . level 2 -> level 1 (global unsync)
>>
>> The dot there means level 3 is not linked to level 2, so invlpg can't
>> reach the global unsync at level 1.
>>
>> kvm_mmu_get_page does sync pages when it finds them, so the code is
>> already safe for the "linking a page which contains global ptes" case
>> you mention above.
>>   
>
> The recursive resync ignores global pages (or it would hit them on cr3  
> switch too).
>
> But we have a bigger problem, invlpg can miss even if nothing is unlinked:
>
>   access address x through global pte
>   -> pte instantiated into spte
>   switch to cr3 where x is not mapped, or mapped differently
>   write to pte
>   -> no fault since pte is unsync
>   invlpg x
>   -> no way this can hit the spte
>   switch cr3 back
>   access address x
>   -> use old spte
>
> Here's one way to make this work:
>
>  - add a hash of global pagetables, indexed by virtual address instead  
> of the pagetable's gfn
>  - invlpg checks this hash in addition to the recursive walk
>
> We'd need to make the virtual address part of sp->role to avoid needing  
> to link the same page multiple times in the virtual address hash.

Humpf, yes. It seems its too expensive/complex to handle this, for such
small gain (~= 2% on AIM7 with RHEL3 guest).

Are you okay with just disabling the global pages optimization?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity April 5, 2009, 11:41 a.m. UTC | #6

Marcelo Tosatti wrote:
>> Here's one way to make this work:
>>
>>  - add a hash of global pagetables, indexed by virtual address instead  
>> of the pagetable's gfn
>>  - invlpg checks this hash in addition to the recursive walk
>>
>> We'd need to make the virtual address part of sp->role to avoid needing  
>> to link the same page multiple times in the virtual address hash.
>>     
>
> Humpf, yes. It seems its too expensive/complex to handle this, for such
> small gain (~= 2% on AIM7 with RHEL3 guest).
>
> Are you okay with just disabling the global pages optimization?
>   

Definitely to plug the hole; and probably for later as well, unless 
people cry out due to regressions.

Please send it in two patches:  one a trivial one to disable global page 
detection which can be sent to -stable as well, and a follow on which 
rips out the global page machinery until (and if) we decide to 
reimplement it correctly.

cr3 OOS optimisation breaks 32-bit GNU/kFreeBSD guest

Commit Message

Comments

Patch