diff mbox series

[2/2] KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'

Message ID 20190307232744.29339-3-sean.j.christopherson@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: fix handling of guest PTE sizes | expand

Commit Message

Sean Christopherson March 7, 2019, 11:27 p.m. UTC
The cr4_pae flag is a bit of a misnomer, its purpose is really to track
whether the guest PTE that is being shadowed is a 4-byte entry or an
8-byte entry.  Prior to supporting nested EPT, the size of the gpte was
reflected purely by CR4.PAE.  KVM fudged things a bit for direct sptes,
but it was mostly harmless since the size of the gpte never mattered.
Now that a spte may be tracking an indirect EPT entry, relying on
CR4.PAE is wrong and ill-named.

For direct shadow pages, force the gpte_size to '1' as they are always
8-byte entries; EPT entries can only be 8-bytes and KVM always uses
8-byte entries for NPT and its identity map (when running with EPT but
not unrestricted guest).

Likewise, nested EPT entries are always 8-bytes.  Nested EPT presents a
unique scenario as the size of the entries are not dictated by CR4.PAE,
but neither is the shadow page a direct map.  To handle this scenario,
set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination,
to denote a nested EPT shadow page.  Use the information to avoid
incorrectly zapping an unsync'd indirect page in __kvm_sync_page().

Providing a consistent and accurate gpte_size fixes a bug reported by
Vitaly where fast_cr3_switch() always fails when switching from L2 to
L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages,
whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE.

Fixes: 7dcd575520082 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 Documentation/virtual/kvm/mmu.txt | 11 +++++----
 arch/x86/include/asm/kvm_host.h   |  4 ++--
 arch/x86/kvm/mmu.c                | 38 +++++++++++++++++++------------
 arch/x86/kvm/mmutrace.h           |  4 ++--
 4 files changed, 35 insertions(+), 22 deletions(-)

Comments

Vitaly Kuznetsov March 11, 2019, 11:56 a.m. UTC | #1
Sean Christopherson <sean.j.christopherson@intel.com> writes:

> The cr4_pae flag is a bit of a misnomer, its purpose is really to track
> whether the guest PTE that is being shadowed is a 4-byte entry or an
> 8-byte entry.  Prior to supporting nested EPT, the size of the gpte was
> reflected purely by CR4.PAE.  KVM fudged things a bit for direct sptes,
> but it was mostly harmless since the size of the gpte never mattered.
> Now that a spte may be tracking an indirect EPT entry, relying on
> CR4.PAE is wrong and ill-named.
>
> For direct shadow pages, force the gpte_size to '1' as they are always
> 8-byte entries; EPT entries can only be 8-bytes and KVM always uses
> 8-byte entries for NPT and its identity map (when running with EPT but
> not unrestricted guest).
>
> Likewise, nested EPT entries are always 8-bytes.  Nested EPT presents a
> unique scenario as the size of the entries are not dictated by CR4.PAE,
> but neither is the shadow page a direct map.  To handle this scenario,
> set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination,
> to denote a nested EPT shadow page.  Use the information to avoid
> incorrectly zapping an unsync'd indirect page in __kvm_sync_page().
>
> Providing a consistent and accurate gpte_size fixes a bug reported by
> Vitaly where fast_cr3_switch() always fails when switching from L2 to
> L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages,
> whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE.
>
> Fixes: 7dcd575520082 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
> Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>

Thank you for working on this!

> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  Documentation/virtual/kvm/mmu.txt | 11 +++++----
>  arch/x86/include/asm/kvm_host.h   |  4 ++--
>  arch/x86/kvm/mmu.c                | 38 +++++++++++++++++++------------
>  arch/x86/kvm/mmutrace.h           |  4 ++--
>  4 files changed, 35 insertions(+), 22 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index f365102c80f5..2747ad774a38 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -142,7 +142,7 @@ Shadow pages contain the following information:
>      If clear, this page corresponds to a guest page table denoted by the gfn
>      field.
>    role.quadrant:
> -    When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit
> +    When role.gpte_size=0, the guest uses 32-bit gptes while the host uses 64-bit
>      sptes.  That means a guest page table contains more ptes than the host,
>      so multiple shadow pages are needed to shadow one guest page.
>      For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
> @@ -158,9 +158,9 @@ Shadow pages contain the following information:
>      The page is invalid and should not be used.  It is a root page that is
>      currently pinned (by a cpu hardware register pointing to it); once it is
>      unpinned it will be destroyed.
> -  role.cr4_pae:
> -    Contains the value of cr4.pae for which the page is valid (e.g. whether
> -    32-bit or 64-bit gptes are in use).
> +  role.gpte_size:
> +    Reflects the size of the guest PTE for which the page is valid, i.e. '1'
> +    if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
>    role.nxe:
>      Contains the value of efer.nxe for which the page is valid.
>    role.cr0_wp:
> @@ -173,6 +173,9 @@ Shadow pages contain the following information:
>      Contains the value of cr4.smap && !cr0.wp for which the page is valid
>      (pages for which this is true are different from other pages; see the
>      treatment of cr0.wp=0 below).
> +  role.ept_sp:
> +    This is a virtual flag to denote a shadowed nested EPT page.  ept_sp
> +    is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
>    role.smm:
>      Is 1 if the page is valid in system management mode.  This field
>      determines which of the kvm_memslots array was used to build this
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7712b4ed8aa1..fdf67d33f6b3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -253,14 +253,14 @@ struct kvm_mmu_memory_cache {
>   * kvm_memory_slot.arch.gfn_track which is 16 bits, so the role bits used
>   * by indirect shadow page can not be more than 15 bits.
>   *
> - * Currently, we used 14 bits that are @level, @cr4_pae, @quadrant, @access,
> + * Currently, we used 14 bits that are @level, @gpte_size, @quadrant, @access,
>   * @nxe, @cr0_wp, @smep_andnot_wp and @smap_andnot_wp.
>   */
>  union kvm_mmu_page_role {
>  	u32 word;
>  	struct {
>  		unsigned level:4;
> -		unsigned cr4_pae:1;
> +		unsigned gpte_size:1;

This is probably just a matter of taste but I would prefer it to be
renamed to something like "gpte_is_64": when I see

role.gpte_size = true

I have to go and read the definition if gpte_size,

role.gpte_is_64 = true

reads more naturally. 

>  		unsigned quadrant:2;
>  		unsigned direct:1;
>  		unsigned access:3;
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 71098f2ae870..2a32a14efd98 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -182,7 +182,7 @@ struct kvm_shadow_walk_iterator {
>  
>  static const union kvm_mmu_page_role mmu_base_role_mask = {
>  	.cr0_wp = 1,
> -	.cr4_pae = 1,
> +	.gpte_size = 1,
>  	.nxe = 1,
>  	.smep_andnot_wp = 1,
>  	.smap_andnot_wp = 1,
> @@ -2205,6 +2205,7 @@ static bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>  static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>  				    struct list_head *invalid_list);
>  
> +

stray blank line

>  #define for_each_valid_sp(_kvm, _sp, _gfn)				\
>  	hlist_for_each_entry(_sp,					\
>  	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \
> @@ -2215,12 +2216,17 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>  	for_each_valid_sp(_kvm, _sp, _gfn)				\
>  		if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
>  
> +static inline bool is_ept_sp(struct kvm_mmu_page *sp)
> +{
> +	return sp->role.cr0_wp && sp->role.smap_andnot_wp;
> +}
> +
>  /* @sp->gfn should be write-protected at the call site */
>  static bool __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  			    struct list_head *invalid_list)
>  {
> -	if (sp->role.cr4_pae != !!is_pae(vcpu)
> -	    || vcpu->arch.mmu->sync_page(vcpu, sp) == 0) {
> +	if ((!is_ept_sp(sp) && sp->role.gpte_size != !!is_pae(vcpu)) ||
> +	    vcpu->arch.mmu->sync_page(vcpu, sp) == 0) {
>  		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
>  		return false;
>  	}
> @@ -2423,7 +2429,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  	role.level = level;
>  	role.direct = direct;
>  	if (role.direct)
> -		role.cr4_pae = 0;
> +		role.gpte_size = true;
>  	role.access = access;
>  	if (!vcpu->arch.mmu->direct_map
>  	    && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
> @@ -4793,7 +4799,6 @@ static union kvm_mmu_role kvm_calc_mmu_role_common(struct kvm_vcpu *vcpu,
>  
>  	role.base.access = ACC_ALL;
>  	role.base.nxe = !!is_nx(vcpu);
> -	role.base.cr4_pae = !!is_pae(vcpu);
>  	role.base.cr0_wp = is_write_protection(vcpu);
>  	role.base.smm = is_smm(vcpu);
>  	role.base.guest_mode = is_guest_mode(vcpu);
> @@ -4814,6 +4819,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu, bool base_only)
>  	role.base.ad_disabled = (shadow_accessed_mask == 0);
>  	role.base.level = kvm_x86_ops->get_tdp_level(vcpu);
>  	role.base.direct = true;
> +	role.base.gpte_size = true;
>  
>  	return role;
>  }
> @@ -4878,6 +4884,7 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu, bool base_only)
>  	role.base.smap_andnot_wp = role.ext.cr4_smap &&
>  		!is_write_protection(vcpu);
>  	role.base.direct = !is_paging(vcpu);
> +	role.base.gpte_size = !!is_pae(vcpu);
>  
>  	if (!is_long_mode(vcpu))
>  		role.base.level = PT32E_ROOT_LEVEL;
> @@ -4918,21 +4925,24 @@ kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool accessed_dirty,
>  				   bool execonly)
>  {
>  	union kvm_mmu_role role = {0};
> -	union kvm_mmu_page_role root_base = vcpu->arch.root_mmu.mmu_role.base;
>  
> -	/* Legacy paging and SMM flags are inherited from root_mmu */
> -	role.base.smm = root_base.smm;
> -	role.base.nxe = root_base.nxe;
> -	role.base.cr0_wp = root_base.cr0_wp;
> -	role.base.smep_andnot_wp = root_base.smep_andnot_wp;
> -	role.base.smap_andnot_wp = root_base.smap_andnot_wp;
> +	/* SMM flag is inherited from root_mmu */
> +	role.base.smm = vcpu->arch.root_mmu.mmu_role.base.smm;
>  
>  	role.base.level = PT64_ROOT_4LEVEL;
> +	role.base.gpte_size = true;
>  	role.base.direct = false;
>  	role.base.ad_disabled = !accessed_dirty;
>  	role.base.guest_mode = true;
>  	role.base.access = ACC_ALL;
>  
> +	/*
> +	 * WP=1 and NOT_WP=1 is an impossible combination, use WP and the
> +	 * SMAP variation to denote shadow EPT entries.
> +	 */
> +	role.base.cr0_wp = true;
> +	role.base.smap_andnot_wp = true;
> +
>  	role.ext = kvm_calc_mmu_role_ext(vcpu);
>  	role.ext.execonly = execonly;
>  
> @@ -5183,7 +5193,7 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
>  		 gpa, bytes, sp->role.word);
>  
>  	offset = offset_in_page(gpa);
> -	pte_size = sp->role.cr4_pae ? 8 : 4;
> +	pte_size = sp->role.gpte_size ? 8 : 4;
>  
>  	/*
>  	 * Sometimes, the OS only writes the last one bytes to update status
> @@ -5207,7 +5217,7 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
>  	page_offset = offset_in_page(gpa);
>  	level = sp->role.level;
>  	*nspte = 1;
> -	if (!sp->role.cr4_pae) {
> +	if (!sp->role.gpte_size) {
>  		page_offset <<= 1;	/* 32->64 */
>  		/*
>  		 * A 32-bit pde maps 4MB while the shadow pdes map
> diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
> index 9f6c855a0043..96f27b654e6f 100644
> --- a/arch/x86/kvm/mmutrace.h
> +++ b/arch/x86/kvm/mmutrace.h
> @@ -29,10 +29,10 @@
>  								        \
>  	role.word = __entry->role;					\
>  									\
> -	trace_seq_printf(p, "sp gfn %llx l%u%s q%u%s %s%s"		\
> +	trace_seq_printf(p, "sp gfn %llx l%u %u-byte q%u%s %s%s"	\
>  			 " %snxe %sad root %u %s%c",			\
>  			 __entry->gfn, role.level,			\
> -			 role.cr4_pae ? " pae" : "",			\
> +			 role.gpte_size ? 8 : 4,			\
>  			 role.quadrant,					\
>  			 role.direct ? " direct" : "",			\
>  			 access_str[role.access],			\

This should work, thanks! I'm going to run some tests and report the result.
Sean Christopherson March 11, 2019, 2:01 p.m. UTC | #2
On Mon, Mar 11, 2019 at 12:56:33PM +0100, Vitaly Kuznetsov wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> 
> > The cr4_pae flag is a bit of a misnomer, its purpose is really to track
> > whether the guest PTE that is being shadowed is a 4-byte entry or an
> > 8-byte entry.  Prior to supporting nested EPT, the size of the gpte was
> > reflected purely by CR4.PAE.  KVM fudged things a bit for direct sptes,
> > but it was mostly harmless since the size of the gpte never mattered.
> > Now that a spte may be tracking an indirect EPT entry, relying on
> > CR4.PAE is wrong and ill-named.
> >
> > For direct shadow pages, force the gpte_size to '1' as they are always
> > 8-byte entries; EPT entries can only be 8-bytes and KVM always uses
> > 8-byte entries for NPT and its identity map (when running with EPT but
> > not unrestricted guest).
> >
> > Likewise, nested EPT entries are always 8-bytes.  Nested EPT presents a
> > unique scenario as the size of the entries are not dictated by CR4.PAE,
> > but neither is the shadow page a direct map.  To handle this scenario,
> > set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination,
> > to denote a nested EPT shadow page.  Use the information to avoid
> > incorrectly zapping an unsync'd indirect page in __kvm_sync_page().
> >
> > Providing a consistent and accurate gpte_size fixes a bug reported by
> > Vitaly where fast_cr3_switch() always fails when switching from L2 to
> > L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages,
> > whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE.
> >
> > Fixes: 7dcd575520082 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
> > Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> 
> Thank you for working on this!
> 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> >  Documentation/virtual/kvm/mmu.txt | 11 +++++----
> >  arch/x86/include/asm/kvm_host.h   |  4 ++--
> >  arch/x86/kvm/mmu.c                | 38 +++++++++++++++++++------------
> >  arch/x86/kvm/mmutrace.h           |  4 ++--
> >  4 files changed, 35 insertions(+), 22 deletions(-)
> >
> > diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> > index f365102c80f5..2747ad774a38 100644
> > --- a/Documentation/virtual/kvm/mmu.txt
> > +++ b/Documentation/virtual/kvm/mmu.txt
> > @@ -142,7 +142,7 @@ Shadow pages contain the following information:
> >      If clear, this page corresponds to a guest page table denoted by the gfn
> >      field.
> >    role.quadrant:
> > -    When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit
> > +    When role.gpte_size=0, the guest uses 32-bit gptes while the host uses 64-bit
> >      sptes.  That means a guest page table contains more ptes than the host,
> >      so multiple shadow pages are needed to shadow one guest page.
> >      For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
> > @@ -158,9 +158,9 @@ Shadow pages contain the following information:
> >      The page is invalid and should not be used.  It is a root page that is
> >      currently pinned (by a cpu hardware register pointing to it); once it is
> >      unpinned it will be destroyed.
> > -  role.cr4_pae:
> > -    Contains the value of cr4.pae for which the page is valid (e.g. whether
> > -    32-bit or 64-bit gptes are in use).
> > +  role.gpte_size:
> > +    Reflects the size of the guest PTE for which the page is valid, i.e. '1'
> > +    if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
> >    role.nxe:
> >      Contains the value of efer.nxe for which the page is valid.
> >    role.cr0_wp:
> > @@ -173,6 +173,9 @@ Shadow pages contain the following information:
> >      Contains the value of cr4.smap && !cr0.wp for which the page is valid
> >      (pages for which this is true are different from other pages; see the
> >      treatment of cr0.wp=0 below).
> > +  role.ept_sp:
> > +    This is a virtual flag to denote a shadowed nested EPT page.  ept_sp
> > +    is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
> >    role.smm:
> >      Is 1 if the page is valid in system management mode.  This field
> >      determines which of the kvm_memslots array was used to build this
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 7712b4ed8aa1..fdf67d33f6b3 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -253,14 +253,14 @@ struct kvm_mmu_memory_cache {
> >   * kvm_memory_slot.arch.gfn_track which is 16 bits, so the role bits used
> >   * by indirect shadow page can not be more than 15 bits.
> >   *
> > - * Currently, we used 14 bits that are @level, @cr4_pae, @quadrant, @access,
> > + * Currently, we used 14 bits that are @level, @gpte_size, @quadrant, @access,
> >   * @nxe, @cr0_wp, @smep_andnot_wp and @smap_andnot_wp.
> >   */
> >  union kvm_mmu_page_role {
> >  	u32 word;
> >  	struct {
> >  		unsigned level:4;
> > -		unsigned cr4_pae:1;
> > +		unsigned gpte_size:1;
> 
> This is probably just a matter of taste but I would prefer it to be
> renamed to something like "gpte_is_64": when I see
> 
> role.gpte_size = true
> 
> I have to go and read the definition if gpte_size,
> 
> role.gpte_is_64 = true
> 
> reads more naturally. 

Agreed.  What about gpte_is_8_bytes to avoid being misconstrued as "this
gpte is for paging in 64-bit mode"?

> >  		unsigned quadrant:2;
> >  		unsigned direct:1;
> >  		unsigned access:3;
Vitaly Kuznetsov March 11, 2019, 2:11 p.m. UTC | #3
Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Mon, Mar 11, 2019 at 12:56:33PM +0100, Vitaly Kuznetsov wrote:
>> Sean Christopherson <sean.j.christopherson@intel.com> writes:
>> 
>> > +		unsigned gpte_size:1;
>> 
>> This is probably just a matter of taste but I would prefer it to be
>> renamed to something like "gpte_is_64": when I see
>> 
>> role.gpte_size = true
>> 
>> I have to go and read the definition if gpte_size,
>> 
>> role.gpte_is_64 = true
>> 
>> reads more naturally. 
>
> Agreed.  What about gpte_is_8_bytes to avoid being misconstrued as "this
> gpte is for paging in 64-bit mode"?
>

Even better!
Vitaly Kuznetsov March 11, 2019, 4:18 p.m. UTC | #4
Vitaly Kuznetsov <vkuznets@redhat.com> writes:

>
> This should work, thanks! I'm going to run some tests and report the result.

FWIW, I ran some tests on both VMX and SVM with and without EPT/NPT and
nothing obvious seems to break; the original issue with
fast_cr3_switch() not working as expected is also gone. So

Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>

for the series.
Paolo Bonzini March 15, 2019, 5:53 p.m. UTC | #5
On 11/03/19 17:18, Vitaly Kuznetsov wrote:
> Vitaly Kuznetsov <vkuznets@redhat.com> writes:
> 
>>
>> This should work, thanks! I'm going to run some tests and report the result.
> 
> FWIW, I ran some tests on both VMX and SVM with and without EPT/NPT and
> nothing obvious seems to break; the original issue with
> fast_cr3_switch() not working as expected is also gone. So
> 
> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> 
> for the series.
> 

Queued for after the merge window with s/gpte_size/gpte_is_8_bytes, thanks.

Paolo
diff mbox series

Patch

diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index f365102c80f5..2747ad774a38 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -142,7 +142,7 @@  Shadow pages contain the following information:
     If clear, this page corresponds to a guest page table denoted by the gfn
     field.
   role.quadrant:
-    When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit
+    When role.gpte_size=0, the guest uses 32-bit gptes while the host uses 64-bit
     sptes.  That means a guest page table contains more ptes than the host,
     so multiple shadow pages are needed to shadow one guest page.
     For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
@@ -158,9 +158,9 @@  Shadow pages contain the following information:
     The page is invalid and should not be used.  It is a root page that is
     currently pinned (by a cpu hardware register pointing to it); once it is
     unpinned it will be destroyed.
-  role.cr4_pae:
-    Contains the value of cr4.pae for which the page is valid (e.g. whether
-    32-bit or 64-bit gptes are in use).
+  role.gpte_size:
+    Reflects the size of the guest PTE for which the page is valid, i.e. '1'
+    if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
   role.nxe:
     Contains the value of efer.nxe for which the page is valid.
   role.cr0_wp:
@@ -173,6 +173,9 @@  Shadow pages contain the following information:
     Contains the value of cr4.smap && !cr0.wp for which the page is valid
     (pages for which this is true are different from other pages; see the
     treatment of cr0.wp=0 below).
+  role.ept_sp:
+    This is a virtual flag to denote a shadowed nested EPT page.  ept_sp
+    is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
   role.smm:
     Is 1 if the page is valid in system management mode.  This field
     determines which of the kvm_memslots array was used to build this
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7712b4ed8aa1..fdf67d33f6b3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -253,14 +253,14 @@  struct kvm_mmu_memory_cache {
  * kvm_memory_slot.arch.gfn_track which is 16 bits, so the role bits used
  * by indirect shadow page can not be more than 15 bits.
  *
- * Currently, we used 14 bits that are @level, @cr4_pae, @quadrant, @access,
+ * Currently, we used 14 bits that are @level, @gpte_size, @quadrant, @access,
  * @nxe, @cr0_wp, @smep_andnot_wp and @smap_andnot_wp.
  */
 union kvm_mmu_page_role {
 	u32 word;
 	struct {
 		unsigned level:4;
-		unsigned cr4_pae:1;
+		unsigned gpte_size:1;
 		unsigned quadrant:2;
 		unsigned direct:1;
 		unsigned access:3;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 71098f2ae870..2a32a14efd98 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -182,7 +182,7 @@  struct kvm_shadow_walk_iterator {
 
 static const union kvm_mmu_page_role mmu_base_role_mask = {
 	.cr0_wp = 1,
-	.cr4_pae = 1,
+	.gpte_size = 1,
 	.nxe = 1,
 	.smep_andnot_wp = 1,
 	.smap_andnot_wp = 1,
@@ -2205,6 +2205,7 @@  static bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 				    struct list_head *invalid_list);
 
+
 #define for_each_valid_sp(_kvm, _sp, _gfn)				\
 	hlist_for_each_entry(_sp,					\
 	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \
@@ -2215,12 +2216,17 @@  static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 	for_each_valid_sp(_kvm, _sp, _gfn)				\
 		if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
 
+static inline bool is_ept_sp(struct kvm_mmu_page *sp)
+{
+	return sp->role.cr0_wp && sp->role.smap_andnot_wp;
+}
+
 /* @sp->gfn should be write-protected at the call site */
 static bool __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 			    struct list_head *invalid_list)
 {
-	if (sp->role.cr4_pae != !!is_pae(vcpu)
-	    || vcpu->arch.mmu->sync_page(vcpu, sp) == 0) {
+	if ((!is_ept_sp(sp) && sp->role.gpte_size != !!is_pae(vcpu)) ||
+	    vcpu->arch.mmu->sync_page(vcpu, sp) == 0) {
 		kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
 		return false;
 	}
@@ -2423,7 +2429,7 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	role.level = level;
 	role.direct = direct;
 	if (role.direct)
-		role.cr4_pae = 0;
+		role.gpte_size = true;
 	role.access = access;
 	if (!vcpu->arch.mmu->direct_map
 	    && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
@@ -4793,7 +4799,6 @@  static union kvm_mmu_role kvm_calc_mmu_role_common(struct kvm_vcpu *vcpu,
 
 	role.base.access = ACC_ALL;
 	role.base.nxe = !!is_nx(vcpu);
-	role.base.cr4_pae = !!is_pae(vcpu);
 	role.base.cr0_wp = is_write_protection(vcpu);
 	role.base.smm = is_smm(vcpu);
 	role.base.guest_mode = is_guest_mode(vcpu);
@@ -4814,6 +4819,7 @@  kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu, bool base_only)
 	role.base.ad_disabled = (shadow_accessed_mask == 0);
 	role.base.level = kvm_x86_ops->get_tdp_level(vcpu);
 	role.base.direct = true;
+	role.base.gpte_size = true;
 
 	return role;
 }
@@ -4878,6 +4884,7 @@  kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu, bool base_only)
 	role.base.smap_andnot_wp = role.ext.cr4_smap &&
 		!is_write_protection(vcpu);
 	role.base.direct = !is_paging(vcpu);
+	role.base.gpte_size = !!is_pae(vcpu);
 
 	if (!is_long_mode(vcpu))
 		role.base.level = PT32E_ROOT_LEVEL;
@@ -4918,21 +4925,24 @@  kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool accessed_dirty,
 				   bool execonly)
 {
 	union kvm_mmu_role role = {0};
-	union kvm_mmu_page_role root_base = vcpu->arch.root_mmu.mmu_role.base;
 
-	/* Legacy paging and SMM flags are inherited from root_mmu */
-	role.base.smm = root_base.smm;
-	role.base.nxe = root_base.nxe;
-	role.base.cr0_wp = root_base.cr0_wp;
-	role.base.smep_andnot_wp = root_base.smep_andnot_wp;
-	role.base.smap_andnot_wp = root_base.smap_andnot_wp;
+	/* SMM flag is inherited from root_mmu */
+	role.base.smm = vcpu->arch.root_mmu.mmu_role.base.smm;
 
 	role.base.level = PT64_ROOT_4LEVEL;
+	role.base.gpte_size = true;
 	role.base.direct = false;
 	role.base.ad_disabled = !accessed_dirty;
 	role.base.guest_mode = true;
 	role.base.access = ACC_ALL;
 
+	/*
+	 * WP=1 and NOT_WP=1 is an impossible combination, use WP and the
+	 * SMAP variation to denote shadow EPT entries.
+	 */
+	role.base.cr0_wp = true;
+	role.base.smap_andnot_wp = true;
+
 	role.ext = kvm_calc_mmu_role_ext(vcpu);
 	role.ext.execonly = execonly;
 
@@ -5183,7 +5193,7 @@  static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
 		 gpa, bytes, sp->role.word);
 
 	offset = offset_in_page(gpa);
-	pte_size = sp->role.cr4_pae ? 8 : 4;
+	pte_size = sp->role.gpte_size ? 8 : 4;
 
 	/*
 	 * Sometimes, the OS only writes the last one bytes to update status
@@ -5207,7 +5217,7 @@  static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
 	page_offset = offset_in_page(gpa);
 	level = sp->role.level;
 	*nspte = 1;
-	if (!sp->role.cr4_pae) {
+	if (!sp->role.gpte_size) {
 		page_offset <<= 1;	/* 32->64 */
 		/*
 		 * A 32-bit pde maps 4MB while the shadow pdes map
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 9f6c855a0043..96f27b654e6f 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -29,10 +29,10 @@ 
 								        \
 	role.word = __entry->role;					\
 									\
-	trace_seq_printf(p, "sp gfn %llx l%u%s q%u%s %s%s"		\
+	trace_seq_printf(p, "sp gfn %llx l%u %u-byte q%u%s %s%s"	\
 			 " %snxe %sad root %u %s%c",			\
 			 __entry->gfn, role.level,			\
-			 role.cr4_pae ? " pae" : "",			\
+			 role.gpte_size ? 8 : 4,			\
 			 role.quadrant,					\
 			 role.direct ? " direct" : "",			\
 			 access_str[role.access],			\