[4/6] kvm: nVMX: support EPT accessed/dirty bits

Message ID	1490867732-16743-5-git-send-email-pbonzini@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Paolo Bonzini <pbonzini@redhat.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: david@redhat.com Subject: [PATCH 4/6] kvm: nVMX: support EPT accessed/dirty bits Date: Thu, 30 Mar 2017 11:55:30 +0200 Message-Id: <1490867732-16743-5-git-send-email-pbonzini@redhat.com> In-Reply-To: <1490867732-16743-1-git-send-email-pbonzini@redhat.com> References: <1490867732-16743-1-git-send-email-pbonzini@redhat.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Paolo Bonzini March 30, 2017, 9:55 a.m. UTC

Now use bit 6 of EPTP to optionally enable A/D bits for EPTP.  Another
thing to change is that, when EPT accessed and dirty bits are not in use,
VMX treats accesses to guest paging structures as data reads.  When they
are in use (bit 6 of EPTP is set), they are treated as writes and the
corresponding EPT dirty bit is set.  The MMU didn't know this detail,
so this patch adds it.

We also have to fix up the exit qualification.  It may be wrong because
KVM sets bit 6 but the guest might not.

L1 emulates EPT A/D bits using write permissions, so in principle it may
be possible for EPT A/D bits to be used by L1 even though not available
in hardware.  The problem is that guest page-table walks will be treated
as reads rather than writes, so they would not cause an EPT violation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +++--
 arch/x86/include/asm/vmx.h      |  2 ++
 arch/x86/kvm/mmu.c              |  4 +++-
 arch/x86/kvm/mmu.h              |  3 ++-
 arch/x86/kvm/paging_tmpl.h      | 33 ++++++++++++++++-----------------
 arch/x86/kvm/vmx.c              | 33 +++++++++++++++++++++++++++++----
 6 files changed, 55 insertions(+), 25 deletions(-)

Radim Krčmář March 31, 2017, 4:24 p.m. UTC | #1

2017-03-30 11:55+0200, Paolo Bonzini:
> Now use bit 6 of EPTP to optionally enable A/D bits for EPTP.  Another
> thing to change is that, when EPT accessed and dirty bits are not in use,
> VMX treats accesses to guest paging structures as data reads.  When they
> are in use (bit 6 of EPTP is set), they are treated as writes and the
> corresponding EPT dirty bit is set.  The MMU didn't know this detail,
> so this patch adds it.
> 
> We also have to fix up the exit qualification.  It may be wrong because
> KVM sets bit 6 but the guest might not.
> 
> L1 emulates EPT A/D bits using write permissions, so in principle it may
> be possible for EPT A/D bits to be used by L1 even though not available
> in hardware.  The problem is that guest page-table walks will be treated
> as reads rather than writes, so they would not cause an EPT violation.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> @@ -319,6 +310,14 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	ASSERT(!(is_long_mode(vcpu) && !is_pae(vcpu)));
>  
>  	accessed_dirty = have_ad ? PT_GUEST_ACCESSED_MASK : 0;
> +
> +	/*
> +	 * FIXME: on Intel processors, loads of the PDPTE registers for PAE paging
> +	 * by the MOV to CR instruction are treated as reads and do not cause the
> +	 * processor to set the dirty flag in tany EPT paging-structure entry.
                                              ^
                                               typo
> +	 */
> +	nested_access = (have_ad ? PFERR_WRITE_MASK : 0) | PFERR_USER_MASK;
> +

This special case should be fairly safe if I understand the consequences
correctly,

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>

> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -6211,6 +6213,18 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> +	if (is_guest_mode(vcpu)
> +	    && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) {
> +		/*
> +		 * Fix up exit_qualification according to whether guest
> +		 * page table accesses are reads or writes.
> +		 */
> +		u64 eptp = nested_ept_get_cr3(vcpu);
> +		exit_qualification &= ~EPT_VIOLATION_ACC_WRITE;
> +		if (eptp & VMX_EPT_AD_ENABLE_BIT)
> +			exit_qualification |= EPT_VIOLATION_ACC_WRITE;

I think this would be better without unconditional clearing

		if (!(eptp & VMX_EPT_AD_ENABLE_BIT))
			exit_qualification &= ~EPT_VIOLATION_ACC_WRITE;

Paolo Bonzini March 31, 2017, 4:26 p.m. UTC | #2

On 31/03/2017 18:24, Radim Krčmář wrote:
>> +	if (is_guest_mode(vcpu)
>> +	    && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) {
>> +		/*
>> +		 * Fix up exit_qualification according to whether guest
>> +		 * page table accesses are reads or writes.
>> +		 */
>> +		u64 eptp = nested_ept_get_cr3(vcpu);
>> +		exit_qualification &= ~EPT_VIOLATION_ACC_WRITE;
>> +		if (eptp & VMX_EPT_AD_ENABLE_BIT)
>> +			exit_qualification |= EPT_VIOLATION_ACC_WRITE;
> I think this would be better without unconditional clearing
> 
> 		if (!(eptp & VMX_EPT_AD_ENABLE_BIT))
> 			exit_qualification &= ~EPT_VIOLATION_ACC_WRITE;

Yeah, this is a remnant of my (failed) attempt at emulating A/D bits
when the processor doesn't support it.  Which worked, only it's not
compliant enough to include it in the final series.

As for the two nits you found, shall I repost or are you okay with
fixing it yourself?

Paolo

Bandan Das April 11, 2017, 11:35 p.m. UTC | #3

Paolo Bonzini <pbonzini@redhat.com> writes:
...
>  	accessed_dirty = have_ad ? PT_GUEST_ACCESSED_MASK : 0;
> +
> +	/*
> +	 * FIXME: on Intel processors, loads of the PDPTE registers for PAE paging
> +	 * by the MOV to CR instruction are treated as reads and do not cause the
> +	 * processor to set the dirty flag in tany EPT paging-structure entry.
> +	 */

Minor typo: "in any EPT paging-structure entry".

> +	nested_access = (have_ad ? PFERR_WRITE_MASK : 0) | PFERR_USER_MASK;
> +
>  	pt_access = pte_access = ACC_ALL;
>  	++walker->level;
>  
> @@ -338,7 +337,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  		walker->pte_gpa[walker->level - 1] = pte_gpa;
>  
>  		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
> -					      PFERR_USER_MASK|PFERR_WRITE_MASK,
> +					      nested_access,
>  					      &walker->fault);

I can't seem to understand the significance of this change (or for that matter
what was before this change).

mmu->translate_gpa() just returns gfn_to_gpa(table_gfn), right ?

Bandan

>  		/*
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 1c372600a962..6aaecc78dd71 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2767,6 +2767,8 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
>  		vmx->nested.nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT |
>  			VMX_EPT_EXTENT_CONTEXT_BIT | VMX_EPT_2MB_PAGE_BIT |
>  			VMX_EPT_1GB_PAGE_BIT;
> +	       if (enable_ept_ad_bits)
> +		       vmx->nested.nested_vmx_ept_caps |= VMX_EPT_AD_BIT;
>  	} else
>  		vmx->nested.nested_vmx_ept_caps = 0;
>  
> @@ -6211,6 +6213,18 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  
>  	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>  
> +	if (is_guest_mode(vcpu)
> +	    && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) {
> +		/*
> +		 * Fix up exit_qualification according to whether guest
> +		 * page table accesses are reads or writes.
> +		 */
> +		u64 eptp = nested_ept_get_cr3(vcpu);
> +		exit_qualification &= ~EPT_VIOLATION_ACC_WRITE;
> +		if (eptp & VMX_EPT_AD_ENABLE_BIT)
> +			exit_qualification |= EPT_VIOLATION_ACC_WRITE;
> +	}
> +
>  	/*
>  	 * EPT violation happened while executing iret from NMI,
>  	 * "blocked by NMI" bit has to be set before next VM entry.
> @@ -9416,17 +9430,26 @@ static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
>  	return get_vmcs12(vcpu)->ept_pointer;
>  }
>  
> -static void nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
>  {
> +	u64 eptp;
> +
>  	WARN_ON(mmu_is_nested(vcpu));
> +	eptp = nested_ept_get_cr3(vcpu);
> +	if ((eptp & VMX_EPT_AD_ENABLE_BIT) && !enable_ept_ad_bits)
> +		return 1;
> +
> +	kvm_mmu_unload(vcpu);
>  	kvm_init_shadow_ept_mmu(vcpu,
>  			to_vmx(vcpu)->nested.nested_vmx_ept_caps &
> -			VMX_EPT_EXECUTE_ONLY_BIT);
> +			VMX_EPT_EXECUTE_ONLY_BIT,
> +			eptp & VMX_EPT_AD_ENABLE_BIT);
>  	vcpu->arch.mmu.set_cr3           = vmx_set_cr3;
>  	vcpu->arch.mmu.get_cr3           = nested_ept_get_cr3;
>  	vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
>  
>  	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
> +	return 0;
>  }
>  
>  static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
> @@ -10188,8 +10211,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  	}
>  
>  	if (nested_cpu_has_ept(vmcs12)) {
> -		kvm_mmu_unload(vcpu);
> -		nested_ept_init_mmu_context(vcpu);
> +		if (nested_ept_init_mmu_context(vcpu)) {
> +			*entry_failure_code = ENTRY_FAIL_DEFAULT;
> +			return 1;
> +		}
>  	} else if (nested_cpu_has2(vmcs12,
>  				   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
>  		vmx_flush_tlb_ept_only(vcpu);

Paolo Bonzini April 11, 2017, 11:54 p.m. UTC | #4

----- Original Message -----
> From: "Bandan Das" <bsd@redhat.com>
> To: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, david@redhat.com
> Sent: Wednesday, April 12, 2017 7:35:16 AM
> Subject: Re: [PATCH 4/6] kvm: nVMX: support EPT accessed/dirty bits
> 
> Paolo Bonzini <pbonzini@redhat.com> writes:
> ...
> >  	accessed_dirty = have_ad ? PT_GUEST_ACCESSED_MASK : 0;
> > +
> > +	/*
> > +	 * FIXME: on Intel processors, loads of the PDPTE registers for PAE
> > paging
> > +	 * by the MOV to CR instruction are treated as reads and do not cause the
> > +	 * processor to set the dirty flag in tany EPT paging-structure entry.
> > +	 */
> 
> Minor typo: "in any EPT paging-structure entry".
> 
> > +	nested_access = (have_ad ? PFERR_WRITE_MASK : 0) | PFERR_USER_MASK;
> > +
> >  	pt_access = pte_access = ACC_ALL;
> >  	++walker->level;
> >  
> > @@ -338,7 +337,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker
> > *walker,
> >  		walker->pte_gpa[walker->level - 1] = pte_gpa;
> >  
> >  		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
> > -					      PFERR_USER_MASK|PFERR_WRITE_MASK,
> > +					      nested_access,
> >  					      &walker->fault);
> 
> I can't seem to understand the significance of this change (or for that
> matter what was before this change).
> 
> mmu->translate_gpa() just returns gfn_to_gpa(table_gfn), right ?

For EPT it is, you're right it's fishy.  The "nested_access" should be
computed in translate_nested_gpa, which is where kvm->arch.nested_mmu
(non-EPT) requests to access kvm->arch.mmu (EPT).

In practice we need to define a new function
vcpu->arch.mmu.gva_to_gpa_nested that computes the nested_access
and calls cpu->arch.mmu.gva_to_gpa.

Thanks,

Paolo

Bandan Das April 12, 2017, 11:02 p.m. UTC | #5

Paolo Bonzini <pbonzini@redhat.com> writes:

> ----- Original Message -----
>> From: "Bandan Das" <bsd@redhat.com>
>> To: "Paolo Bonzini" <pbonzini@redhat.com>
>> Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, david@redhat.com
>> Sent: Wednesday, April 12, 2017 7:35:16 AM
>> Subject: Re: [PATCH 4/6] kvm: nVMX: support EPT accessed/dirty bits
>> 
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>> ...
>> >  	accessed_dirty = have_ad ? PT_GUEST_ACCESSED_MASK : 0;
>> > +
>> > +	/*
>> > +	 * FIXME: on Intel processors, loads of the PDPTE registers for PAE
>> > paging
>> > +	 * by the MOV to CR instruction are treated as reads and do not cause the
>> > +	 * processor to set the dirty flag in tany EPT paging-structure entry.
>> > +	 */
>> 
>> Minor typo: "in any EPT paging-structure entry".
>> 
>> > +	nested_access = (have_ad ? PFERR_WRITE_MASK : 0) | PFERR_USER_MASK;
>> > +
>> >  	pt_access = pte_access = ACC_ALL;
>> >  	++walker->level;
>> >  
>> > @@ -338,7 +337,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker
>> > *walker,
>> >  		walker->pte_gpa[walker->level - 1] = pte_gpa;
>> >  
>> >  		real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
>> > -					      PFERR_USER_MASK|PFERR_WRITE_MASK,
>> > +					      nested_access,
>> >  					      &walker->fault);
>> 
>> I can't seem to understand the significance of this change (or for that
>> matter what was before this change).
>> 
>> mmu->translate_gpa() just returns gfn_to_gpa(table_gfn), right ?
>
> For EPT it is, you're right it's fishy.  The "nested_access" should be
> computed in translate_nested_gpa, which is where kvm->arch.nested_mmu
> (non-EPT) requests to access kvm->arch.mmu (EPT).

Thanks for the clarification. Is it the case when L1 runs L2 without
EPT ? I can't figure out the case where translate_nested_gpa will actually
be called. FNAME(walk_addr_nested) calls walk_addr_generic
with &vcpu->arch.nested_mmu and init_kvm_nested_mmu() sets gva_to_gpa()
with the appropriate "_nested" functions. But the gva_to_gpa() pointers
don't seem to get invoked at all for the nested case.

BTW, just noticed that setting PFERR_USER_MASK is redundant since
translate_nested_gpa does it too.

Bandan

> In practice we need to define a new function
> vcpu->arch.mmu.gva_to_gpa_nested that computes the nested_access
> and calls cpu->arch.mmu.gva_to_gpa.
>
> Thanks,
>
> Paolo

Paolo Bonzini April 14, 2017, 5:17 a.m. UTC | #6

On 13/04/2017 07:02, Bandan Das wrote:
>> For EPT it is, you're right it's fishy.  The "nested_access" should be
>> computed in translate_nested_gpa, which is where kvm->arch.nested_mmu
>> (non-EPT) requests to access kvm->arch.mmu (EPT).
>
> Thanks for the clarification. Is it the case when L1 runs L2 without
> EPT ? I can't figure out the case where translate_nested_gpa will actually
> be called.

It happens when L2 instruction are emulated by L0, for example when L1
is passing through I/O ports to L2 and L2 runs an "insb" instruction.  I
think this case is not covered by vmx.flat.

Paolo

> FNAME(walk_addr_nested) calls walk_addr_generic
> with &vcpu->arch.nested_mmu and init_kvm_nested_mmu() sets gva_to_gpa()
> with the appropriate "_nested" functions. But the gva_to_gpa() pointers
> don't seem to get invoked at all for the nested case.
> 
> BTW, just noticed that setting PFERR_USER_MASK is redundant since
> translate_nested_gpa does it too.

[4/6] kvm: nVMX: support EPT accessed/dirty bits

Commit Message

Comments

Patch