[RFC,5/6] KVM: X86: Alloc pae_root shadow page

Message ID	20211210092508.7185-6-jiangshanlai@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Lai Jiangshan <jiangshanlai@gmail.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <seanjc@google.com> Cc: Lai Jiangshan <laijs@linux.alibaba.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com> Subject: [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page Date: Fri, 10 Dec 2021 17:25:07 +0800 Message-Id: <20211210092508.7185-6-jiangshanlai@gmail.com> In-Reply-To: <20211210092508.7185-1-jiangshanlai@gmail.com> References: <20211210092508.7185-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: X86: Add and use shadow page with level promoted or acting as pae_root \| expand [RFC,0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root [RFC,1/6] KVM: X86: Check root_level only in fast_pgd_switch() [RFC,2/6] KVM: X86: Walk shadow page starting with shadow_root_level [RFC,3/6] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() [RFC,4/6] KVM: X86: Introduce role.level_promoted [RFC,5/6] KVM: X86: Alloc pae_root shadow page [RFC,6/6] KVM: X86: Use level_promoted and pae_root shadow page for 32bit guests

Lai Jiangshan Dec. 10, 2021, 9:25 a.m. UTC

From: Lai Jiangshan <laijs@linux.alibaba.com>

Currently pae_root is special root page, this patch adds facility to
allow using kvm_mmu_get_page() to allocate pae_root shadow page.

When kvm_mmu_get_page() is called for level == PT32E_ROOT_LEVEL and
vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, it will get a
DMA32 root page with default PAE pdptes installed.

The pae_root bit is needed in the page role because:
	it is required to be DMA32 page.
	its first 4 sptes are initialized with default_pae_pdpte.

default_pae_pdpte is needed because the cpu expect PAE pdptes are
present when VMenter.  default_pae_pdpte is designed to have no
SPTE_MMU_PRESENT_MASK so that it is present in the view of CPU but not
present in the view of shadow papging, and the page fault handler will
replace it with real present shadow page.

When changing from default_pae_pdpte to a present spte, no tlb flushing
is requested, although both are present in the view of CPU.  The reason
is that default_pae_pdpte points to zero page, no pte is present if the
paging structure is cached.

No functionality changed since this code is not activated because when
vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, kvm_mmu_get_page()
is only called for level == 1 or 2 now.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/include/asm/kvm_host.h |   4 +-
 arch/x86/kvm/mmu/mmu.c          | 113 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmu/paging_tmpl.h  |   1 +
 3 files changed, 114 insertions(+), 4 deletions(-)

Sean Christopherson Jan. 4, 2022, 9:54 p.m. UTC | #1

On Fri, Dec 10, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Currently pae_root is special root page, this patch adds facility to
> allow using kvm_mmu_get_page() to allocate pae_root shadow page.
> 
> When kvm_mmu_get_page() is called for level == PT32E_ROOT_LEVEL and
> vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, it will get a
> DMA32 root page with default PAE pdptes installed.
> 
> The pae_root bit is needed in the page role because:
> 	it is required to be DMA32 page.

Please try to avoid ambiguous pronouns, "it" here does not refer to any of the
nouns that were established in the initial sentence, e.g.

        PAE roots must be allocated below 4gb (CR3 has only 32 bits).

> 	its first 4 sptes are initialized with default_pae_pdpte.
> 
> default_pae_pdpte is needed because the cpu expect PAE pdptes are
> present when VMenter.

That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
is perfectly ok, present with reserved bits is what's not allowed.

Intel SDM:
  A VM entry that checks the validity of the PDPTEs uses the same checks that are
  used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
  would cause a general-protection exception due to the PDPTEs that would be loaded
  (e.g., because a reserved bit is set), the VM entry fails.
  
  7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
     (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.

AMD APM:
  On current AMD64 processors, in native mode, these four entries are unconditionally
  loaded into the table walk cache whenever CR3 is written with the PDPT base address,
  and remain locked in. At this point they are also checked for reserved bit violations,
  and if such violations are present a general-protection exception (#GP) occurs.

  Under SVM, however, when the processor is in guest mode with PAE enabled, the
  guest PDPT entries are not cached or validated at this point, but instead are loaded
  and checked on demand in the normal course of address translation, just like page
  directory and page table entries. Any reserved bit violations are detected at the point
  of use, and result in a page-fault (#PF) exception rather than a Page Translation and
  Protection general-protection (#GP) exception.

> default_pae_pdpte is designed to have no
> SPTE_MMU_PRESENT_MASK so that it is present in the view of CPU but not
> present in the view of shadow papging, and the page fault handler will
> replace it with real present shadow page.

> 
> When changing from default_pae_pdpte to a present spte, no tlb flushing
> is requested, although both are present in the view of CPU.  The reason
> is that default_pae_pdpte points to zero page, no pte is present if the
> paging structure is cached.
> 
> No functionality changed since this code is not activated because when
> vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, kvm_mmu_get_page()
> is only called for level == 1 or 2 now.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   4 +-
>  arch/x86/kvm/mmu/mmu.c          | 113 +++++++++++++++++++++++++++++++-
>  arch/x86/kvm/mmu/paging_tmpl.h  |   1 +
>  3 files changed, 114 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6465c83794fc..82a8844f80ac 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -335,7 +335,8 @@ union kvm_mmu_page_role {
>  		unsigned ad_disabled:1;
>  		unsigned guest_mode:1;
>  		unsigned level_promoted:1;
> -		unsigned :5;
> +		unsigned pae_root:1;
> +		unsigned :4;
>  
>  		/*
>  		 * This is left at the top of the word so that
> @@ -695,6 +696,7 @@ struct kvm_vcpu_arch {
>  	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
>  	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
>  	struct kvm_mmu_memory_cache mmu_page_header_cache;
> +	unsigned long mmu_pae_root_cache;

Hmm, I think it will be easier to understand exactly what this "cache" holds if
the cache is tracked as a pointer, either the page_address() "void *" or maybe
even the original "struct page".  A "void *" is probably the way to go; less code
than tracking "struct page" and kvm_mmu_memory_cache_alloc() also returns "void *".

>  	/*
>  	 * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4769253e9024..0d2976dad863 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -724,6 +724,67 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> +static u64 default_pae_pdpte;
> +
> +static void free_default_pae_pdpte(void)
> +{
> +	free_page((unsigned long)__va(default_pae_pdpte & PAGE_MASK));
> +	default_pae_pdpte = 0;
> +}
> +
> +static int alloc_default_pae_pdpte(void)
> +{
> +	unsigned long p = __get_free_page(GFP_KERNEL | __GFP_ZERO);
> +
> +	if (!p)
> +		return -ENOMEM;
> +	default_pae_pdpte = __pa(p) | PT_PRESENT_MASK | shadow_me_mask;
> +	if (WARN_ON(is_shadow_present_pte(default_pae_pdpte) ||
> +		    is_mmio_spte(default_pae_pdpte))) {
> +		free_default_pae_pdpte();
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static int alloc_pae_root(struct kvm_vcpu *vcpu)

I'd prefer we name this mmu_topup_pae_root_cache().  "alloc_pae_root" implies the
helper returns the allocated root.

> +{
> +	struct page *page;
> +	unsigned long pae_root;
> +	u64* pdpte;
> +
> +	if (vcpu->arch.mmu->shadow_root_level != PT32E_ROOT_LEVEL)
> +		return 0;
> +	if (vcpu->arch.mmu_pae_root_cache)
> +		return 0;
> +
> +	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_DMA32);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	pae_root = (unsigned long)page_address(page);
> +
> +	/*
> +	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
> +	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
> +	 * that KVM's writes and the CPU's reads get along.  Note, this is
> +	 * only necessary when using shadow paging, as 64-bit NPT can get at
> +	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
> +	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
> +	 */
> +	if (!tdp_enabled)
> +		set_memory_decrypted(pae_root, 1);
> +	else
> +		WARN_ON_ONCE(shadow_me_mask);
> +	vcpu->arch.mmu_pae_root_cache = pae_root;
> +	pdpte = (void *)pae_root;
> +	pdpte[0] = default_pae_pdpte;
> +	pdpte[1] = default_pae_pdpte;
> +	pdpte[2] = default_pae_pdpte;
> +	pdpte[3] = default_pae_pdpte;
> +	return 0;

Assuming I'm not missing something regarding default_pae_pdpte, this can be:

	struct page *page;

	if (vcpu->arch.mmu->shadow_root_level != PT32E_ROOT_LEVEL)
		return 0;
	if (vcpu->arch.mmu_pae_root_cache)
		return 0;

	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_DMA32);
	if (!page)
		return -ENOMEM;

	vcpu->arch.mmu_pae_root_cache = page_address(page);

	/*
	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
	 * that KVM's writes and the CPU's reads get along.  Note, this is
	 * only necessary when using shadow paging, as 64-bit NPT can get at
	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
	 */
	if (!tdp_enabled)
		set_memory_decrypted((unsigned long)vcpu->arch.mmu_pae_root_cache, 1);
	else
		WARN_ON_ONCE(shadow_me_mask);
	return 0;


> +}
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>  	int r;
> @@ -735,6 +796,9 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  		return r;
>  	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
>  				       PT64_ROOT_MAX_LEVEL);
> +	if (r)
> +		return r;
> +	r = alloc_pae_root(vcpu);
>  	if (r)
>  		return r;
>  	if (maybe_indirect) {
> @@ -753,6 +817,10 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> +	if (!tdp_enabled && vcpu->arch.mmu_pae_root_cache)
> +		set_memory_encrypted(vcpu->arch.mmu_pae_root_cache, 1);
> +	free_page(vcpu->arch.mmu_pae_root_cache);
> +	vcpu->arch.mmu_pae_root_cache = 0;

If this ends up being a void *, set to NULL intead of 0.

>  }
>  
>  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
> @@ -1706,6 +1774,8 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>  	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
>  	hlist_del(&sp->hash_link);
>  	list_del(&sp->link);
> +	if (!tdp_enabled && sp->role.pae_root)
> +		set_memory_encrypted((unsigned long)sp->spt, 1);

Hmm, I wonder if it would be better to do:

	if (sp->role.pae_root)
		mmu_free_pae_root(sp->spt);
	else
		free_page((unsigned long)sp->spt);

and then share the helper with mmu_free_memory_caches(), e.g.

	if (vcpu->arch.mmu_pae_root_cache) {
		mmu_free_pae_root(vcpu->arch.mmu_pae_root_cache);
		vcpu->arch.mmu_pae_root_cache = NULL;
	}

This is perfectly ok, but I think it would be asy for someone to misread
mmu_free_memory_caches() and think that the free_page() there could be inside the
if statement.  Having this in a helper:

	if (!tdp_enabled)
		set_memory_encrypted((unsigned long)root, 1);
	free_page((unsigned long)root);

makes it a bit more obvious that re-encrypting memory is a !TDP specific action
that's on top of the "applies to everything" free_page().

>  	free_page((unsigned long)sp->spt);
>  	if (!sp->role.direct && !sp->role.level_promoted)
>  		free_page((unsigned long)sp->gfns);
> @@ -1735,8 +1805,13 @@ static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
>  static void drop_parent_pte(struct kvm_mmu_page *sp,
>  			    u64 *parent_pte)
>  {
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(parent_pte);
> +
>  	mmu_page_remove_parent_pte(sp, parent_pte);
> -	mmu_spte_clear_no_track(parent_pte);
> +	if (!parent_sp->role.pae_root)
> +		mmu_spte_clear_no_track(parent_pte);
> +	else
> +		__update_clear_spte_fast(parent_pte, default_pae_pdpte);
>  }
>  
>  static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn, union kvm_mmu_page_role role)
> @@ -1744,7 +1819,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  	struct kvm_mmu_page *sp;
>  
>  	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +	if (!role.pae_root) {
> +		sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +	} else {
> +		sp->spt = (void *)vcpu->arch.mmu_pae_root_cache;
> +		vcpu->arch.mmu_pae_root_cache = 0;
> +	}
>  	if (!(role.direct || role.level_promoted))
>  		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> @@ -2091,6 +2171,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  	}
>  	if (role.level_promoted && (level <= vcpu->arch.mmu->root_level))
>  		role.level_promoted = 0;
> +	if (role.pae_root && (level < PT32E_ROOT_LEVEL))

Hmm, for both these checks, I think it's easier to read if the first test is
dropped, i.e.

	if (level <= vcpu->arch.mmu->root_level)
		role.level_promoted = 0;
	if (level != PT32E_ROOT_LEVEL)
		role.pae_root = 0;
		
> +		role.pae_root = 0;
>  
>  	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>  	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> @@ -2226,14 +2308,27 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
>  	__shadow_walk_next(iterator, *iterator->sptep);
>  }
>  
> +static u64 make_pae_pdpte(u64 *child_pt)
> +{
> +	u64 spte = __pa(child_pt) | PT_PRESENT_MASK;
> +
> +	/* The only ignore bits in PDPTE are 11:9. */

Yeah, stupid PAE paging. :-/

> +	BUILD_BUG_ON(!(GENMASK(11,9) & SPTE_MMU_PRESENT_MASK));
> +	return spte | SPTE_MMU_PRESENT_MASK | shadow_me_mask;

I'd say drop "spte" and just do:

	return __pa(child_pt) | PT_PRESENT_MASK | SPTE_MMU_PRESENT_MASK |
	       shadow_me_mask;

> +}
> +
>  static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
>  			     struct kvm_mmu_page *sp)
>  {
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
>  	u64 spte;
>  
>  	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
>  
> -	spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> +	if (!parent_sp->role.pae_root)
> +		spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> +	else
> +		spte = make_pae_pdpte(sp->spt);
>  
>  	mmu_spte_set(sptep, spte);
>  
> @@ -4733,6 +4828,8 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>  	role.base.level = kvm_mmu_get_tdp_level(vcpu);
>  	role.base.direct = true;
>  	role.base.has_4_byte_gpte = false;
> +	if (role.base.level == PT32E_ROOT_LEVEL)
> +		role.base.pae_root = 1;
>  
>  	return role;
>  }
> @@ -4798,6 +4895,9 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu,
>  	else
>  		role.base.level = PT64_ROOT_4LEVEL;
>  
> +	if (!____is_cr0_pg(regs) || !____is_efer_lma(regs))

The CR0.PG check is redundant as EFER.LMA cannot be '1' if CR0.PG=0.  It's confusing
because the role.base.level calculation above omits the CR0.PG check.  Speaking of
which, I think this would be clearer:

	if (role.base.level == PT32E_ROOT_LEVEL)
		role.base.pae_root = 1;

Alternatively, it could be folded in to the level calcuation, e.g.

	if (!____is_efer_lma(regs)) {
		role.base.level = PT32E_ROOT_LEVEL;
		role.base.pae_root = 1;
		if (____is_cr0_pg(regs) && !____is_cr4_pse(regs))
			role.base.level_promoted = 1;
	} else if (____is_cr4_la57(regs)) {
		role.base.level = PT64_ROOT_5LEVEL;
	} else {
		role.base.level = PT64_ROOT_4LEVEL;
	}

though I think I prefer to keep that purely a level calcuation.

> +		role.base.pae_root = 1;
> +
>  	return role;
>  }
>  
> @@ -4845,6 +4945,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
>  	role.base.level = kvm_mmu_get_tdp_level(vcpu);
>  	if (role.base.level > role_regs_to_root_level(regs))
>  		role.base.level_promoted = 1;
> +	if (role.base.level == PT32E_ROOT_LEVEL)
> +		role.base.pae_root = 1;

I almost wonder if we'd be better off having a "root" flag, and then a helper:

  static bool is_pae_root(union kvm_mmu_page_role role)
  {
  	return role.root && role.level == PT32E_ROOT_LEVEL;
  }

I believe the only collateral damage would be that it'd prevent reusing a 4-level
root in a 5-level paging structure, but that's not necessarily a bad thing.

>  
>  	return role;
>  }
> @@ -6133,6 +6235,10 @@ int kvm_mmu_module_init(void)
>  	if (ret)
>  		goto out;
>  
> +	ret = alloc_default_pae_pdpte();
> +	if (ret)
> +		goto out;
> +
>  	return 0;
>  
>  out:
> @@ -6174,6 +6280,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
>  
>  void kvm_mmu_module_exit(void)
>  {
> +	free_default_pae_pdpte();
>  	mmu_destroy_caches();
>  	percpu_counter_destroy(&kvm_total_used_mmu_pages);
>  	unregister_shrinker(&mmu_shrinker);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 16ac276d342a..014136e15b26 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1044,6 +1044,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		.access = 0x7,
>  		.quadrant = 0x3,
>  		.level_promoted = 0x1,
> +		.pae_root = 0x1,
>  	};
>  
>  	/*
> -- 
> 2.19.1.6.gb485710b
>

Lai Jiangshan Jan. 5, 2022, 3:11 a.m. UTC | #2

On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@google.com> wrote:

> >
> > default_pae_pdpte is needed because the cpu expect PAE pdptes are
> > present when VMenter.
>
> That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
> is perfectly ok, present with reserved bits is what's not allowed.
>
> Intel SDM:
>   A VM entry that checks the validity of the PDPTEs uses the same checks that are
>   used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
>   would cause a general-protection exception due to the PDPTEs that would be loaded
>   (e.g., because a reserved bit is set), the VM entry fails.
>
>   7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
>      (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.

But in practice, the VM entry fails if the present bit is not set in
the PDPTE for
the linear address being accessed (when EPT enabled at least).  The host kvm
complains and dumps the vmcs state.

Setting a default pdpte is the simplest way to solve it.

Sean Christopherson Jan. 5, 2022, 4:45 p.m. UTC | #3

On Wed, Jan 05, 2022, Lai Jiangshan wrote:
> On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@google.com> wrote:
> 
> > >
> > > default_pae_pdpte is needed because the cpu expect PAE pdptes are
> > > present when VMenter.
> >
> > That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
> > is perfectly ok, present with reserved bits is what's not allowed.
> >
> > Intel SDM:
> >   A VM entry that checks the validity of the PDPTEs uses the same checks that are
> >   used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
> >   would cause a general-protection exception due to the PDPTEs that would be loaded
> >   (e.g., because a reserved bit is set), the VM entry fails.
> >
> >   7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
> >      (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.
> 
> But in practice, the VM entry fails if the present bit is not set in the
> PDPTE for the linear address being accessed (when EPT enabled at least).  The
> host kvm complains and dumps the vmcs state.

That doesn't make any sense.  If EPT is enabled, KVM should never use a pae_root.
The vmcs.GUEST_PDPTRn fields are in play, but those shouldn't derive from KVM's
shadow page tables.

And I doubt there is a VMX ucode bug at play, as KVM currently uses '0' in its
shadow page tables for not-present PDPTEs.

If you can post/provide the patches that lead to VM-Fail, I'd be happy to help
debug.

Lai Jiangshan Jan. 6, 2022, 2:01 a.m. UTC | #4

On 2022/1/6 00:45, Sean Christopherson wrote:
> On Wed, Jan 05, 2022, Lai Jiangshan wrote:
>> On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@google.com> wrote:
>>
>>>>
>>>> default_pae_pdpte is needed because the cpu expect PAE pdptes are
>>>> present when VMenter.
>>>
>>> That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
>>> is perfectly ok, present with reserved bits is what's not allowed.
>>>
>>> Intel SDM:
>>>    A VM entry that checks the validity of the PDPTEs uses the same checks that are
>>>    used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
>>>    would cause a general-protection exception due to the PDPTEs that would be loaded
>>>    (e.g., because a reserved bit is set), the VM entry fails.
>>>
>>>    7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
>>>       (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.
>>
>> But in practice, the VM entry fails if the present bit is not set in the
>> PDPTE for the linear address being accessed (when EPT enabled at least).  The
>> host kvm complains and dumps the vmcs state.
> 
> That doesn't make any sense.  If EPT is enabled, KVM should never use a pae_root.
> The vmcs.GUEST_PDPTRn fields are in play, but those shouldn't derive from KVM's
> shadow page tables.

Oh, I wrote the negative what I want to say again when I try to emphasis
something after I wrote a sentence and modified it several times.

I wanted to mean "EPT not enabled" when vmx.

The VM entry fails when the guest is in very early stage when booting which
might be still in real mode.

VMEXIT: intr_info=00000000 errorcode=0000000 ilen=00000000
reason=80000021 qualification=0000000000000002

IDTVectoring: info=00000000 errorcode=00000000

> 
> And I doubt there is a VMX ucode bug at play, as KVM currently uses '0' in its
> shadow page tables for not-present PDPTEs.
> 
> If you can post/provide the patches that lead to VM-Fail, I'd be happy to help
> debug.

If you can try this patchset, you can just set the default_pae_pdpte to 0 to test
it.

If you can't try this patchset, the mmu->pae_root can be possible to be modified
to test it.

I guess the vmx fails to translate %rip when VMentry in this case.

Sean Christopherson Jan. 6, 2022, 7:41 p.m. UTC | #5

On Thu, Jan 06, 2022, Lai Jiangshan wrote:
> 
> 
> On 2022/1/6 00:45, Sean Christopherson wrote:
> > On Wed, Jan 05, 2022, Lai Jiangshan wrote:
> > > On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > > > 
> > > > > default_pae_pdpte is needed because the cpu expect PAE pdptes are
> > > > > present when VMenter.
> > > > 
> > > > That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
> > > > is perfectly ok, present with reserved bits is what's not allowed.
> > > > 
> > > > Intel SDM:
> > > >    A VM entry that checks the validity of the PDPTEs uses the same checks that are
> > > >    used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
> > > >    would cause a general-protection exception due to the PDPTEs that would be loaded
> > > >    (e.g., because a reserved bit is set), the VM entry fails.
> > > > 
> > > >    7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
> > > >       (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.
> > > 
> > > But in practice, the VM entry fails if the present bit is not set in the
> > > PDPTE for the linear address being accessed (when EPT enabled at least).  The
> > > host kvm complains and dumps the vmcs state.
> > 
> > That doesn't make any sense.  If EPT is enabled, KVM should never use a pae_root.
> > The vmcs.GUEST_PDPTRn fields are in play, but those shouldn't derive from KVM's
> > shadow page tables.
> 
> Oh, I wrote the negative what I want to say again when I try to emphasis
> something after I wrote a sentence and modified it several times.
> 
> I wanted to mean "EPT not enabled" when vmx.

Heh, that makes a lot more sense.

> The VM entry fails when the guest is in very early stage when booting which
> might be still in real mode.
> 
> VMEXIT: intr_info=00000000 errorcode=0000000 ilen=00000000
> reason=80000021 qualification=0000000000000002

Yep, that's the signature for an illegal PDPTE at VM-Enter.  But as noted above,
a not-present PDPTE is perfectly legal, VM-Enter should failed if and only if a
PDPTE is present and has reserved bits set.

> IDTVectoring: info=00000000 errorcode=00000000
> 
> > 
> > And I doubt there is a VMX ucode bug at play, as KVM currently uses '0' in its
> > shadow page tables for not-present PDPTEs.
> > 
> > If you can post/provide the patches that lead to VM-Fail, I'd be happy to help
> > debug.
> 
> If you can try this patchset, you can just set the default_pae_pdpte to 0 to test
> it.

I can't reproduce the failure with this on top of your series + kvm/queue (commit
cc0e35f9c2d4 ("KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled")).

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f6f7caf76b70..b7170a840330 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -728,22 +728,11 @@ static u64 default_pae_pdpte;

 static void free_default_pae_pdpte(void)
 {
-       free_page((unsigned long)__va(default_pae_pdpte & PAGE_MASK));
        default_pae_pdpte = 0;
 }

 static int alloc_default_pae_pdpte(void)
 {
-       unsigned long p = __get_free_page(GFP_KERNEL | __GFP_ZERO);
-
-       if (!p)
-               return -ENOMEM;
-       default_pae_pdpte = __pa(p) | PT_PRESENT_MASK | shadow_me_mask;
-       if (WARN_ON(is_shadow_present_pte(default_pae_pdpte) ||
-                   is_mmio_spte(default_pae_pdpte))) {
-               free_default_pae_pdpte();
-               return -EINVAL;
-       }
        return 0;
 }

Are you using a different base and/or running with other changes?

To aid debug, the below patch will dump the PDPTEs from the current MMU root on
failure (I'll also submit this as a formal patch).  On failure, I would expect
that at least one of the PDPTEs will be present with reserved bits set.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fe06b02994e6..c13f37ef1bbc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5773,11 +5773,19 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
        pr_err("CR4: actual=0x%016lx, shadow=0x%016lx, gh_mask=%016lx\n",
               cr4, vmcs_readl(CR4_READ_SHADOW), vmcs_readl(CR4_GUEST_HOST_MASK));
        pr_err("CR3 = 0x%016lx\n", vmcs_readl(GUEST_CR3));
-       if (cpu_has_vmx_ept()) {
+       if (enable_ept) {
                pr_err("PDPTR0 = 0x%016llx  PDPTR1 = 0x%016llx\n",
                       vmcs_read64(GUEST_PDPTR0), vmcs_read64(GUEST_PDPTR1));
                pr_err("PDPTR2 = 0x%016llx  PDPTR3 = 0x%016llx\n",
                       vmcs_read64(GUEST_PDPTR2), vmcs_read64(GUEST_PDPTR3));
+       } else if (vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL &&
+                  VALID_PAGE(vcpu->arch.mmu->root_hpa)) {
+               u64 *pdpte = __va(vcpu->arch.mmu->root_hpa);
+
+               pr_err("PDPTE0 = 0x%016llx  PDPTE1 = 0x%016llx\n",
+                      pdpte[0], pdpte[1]);
+               pr_err("PDPTE2 = 0x%016llx  PDPTE3 = 0x%016llx\n",
+                      pdpte[2], pdpte[3]);
        }
        pr_err("RSP = 0x%016lx  RIP = 0x%016lx\n",
               vmcs_readl(GUEST_RSP), vmcs_readl(GUEST_RIP));

> If you can't try this patchset, the mmu->pae_root can be possible to be modified
> to test it.
> 
> I guess the vmx fails to translate %rip when VMentry in this case.

No, the CPU doesn't translate RIP at VM-Enter, vmcs.GUEST_RIP is only checked for
legality, e.g. that it's canonical.  Translating RIP through page tables is firmly
a post-VM-Enter code fetch action.

Lai Jiangshan Jan. 7, 2022, 4:36 a.m. UTC | #6

On 2022/1/7 03:41, Sean Christopherson wrote:
> On Thu, Jan 06, 2022, Lai Jiangshan wrote:
>>
>>
>> On 2022/1/6 00:45, Sean Christopherson wrote:
>>> On Wed, Jan 05, 2022, Lai Jiangshan wrote:
>>>> On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@google.com> wrote:
>>>>
>>>>>>
>>>>>> default_pae_pdpte is needed because the cpu expect PAE pdptes are
>>>>>> present when VMenter.
>>>>>
>>>>> That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
>>>>> is perfectly ok, present with reserved bits is what's not allowed.
>>>>>
>>>>> Intel SDM:
>>>>>     A VM entry that checks the validity of the PDPTEs uses the same checks that are
>>>>>     used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
>>>>>     would cause a general-protection exception due to the PDPTEs that would be loaded
>>>>>     (e.g., because a reserved bit is set), the VM entry fails.
>>>>>
>>>>>     7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
>>>>>        (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.
>>>>
>>>> But in practice, the VM entry fails if the present bit is not set in the
>>>> PDPTE for the linear address being accessed (when EPT enabled at least).  The
>>>> host kvm complains and dumps the vmcs state.
>>>
>>> That doesn't make any sense.  If EPT is enabled, KVM should never use a pae_root.
>>> The vmcs.GUEST_PDPTRn fields are in play, but those shouldn't derive from KVM's
>>> shadow page tables.
>>
>> Oh, I wrote the negative what I want to say again when I try to emphasis
>> something after I wrote a sentence and modified it several times.
>>
>> I wanted to mean "EPT not enabled" when vmx.
> 
> Heh, that makes a lot more sense.
> 
>> The VM entry fails when the guest is in very early stage when booting which
>> might be still in real mode.
>>
>> VMEXIT: intr_info=00000000 errorcode=0000000 ilen=00000000
>> reason=80000021 qualification=0000000000000002
> 
> Yep, that's the signature for an illegal PDPTE at VM-Enter.  But as noted above,
> a not-present PDPTE is perfectly legal, VM-Enter should failed if and only if a
> PDPTE is present and has reserved bits set.
> 
>> IDTVectoring: info=00000000 errorcode=00000000
>>
>>>
>>> And I doubt there is a VMX ucode bug at play, as KVM currently uses '0' in its
>>> shadow page tables for not-present PDPTEs.
>>>
>>> If you can post/provide the patches that lead to VM-Fail, I'd be happy to help
>>> debug.
>>
>> If you can try this patchset, you can just set the default_pae_pdpte to 0 to test
>> it.
> 
> I can't reproduce the failure with this on top of your series + kvm/queue (commit
> cc0e35f9c2d4 ("KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled")).
> 


I can't reproduce the failure with this code base either.  And I can't reproduce
the failure when I switch to the code base when I developed it.

After reviewing all the logs I saved that time, I think it was fixed after
make_pae_pdpte().  I should have added make_pae_pdpte() first before added
default_pae_pdpte.  (The code was still mess and the guest can't fully
function even when make_pae_pdpte() was added that time)

Removing default_pae_pdpte will simplify the code. Thank you.

Thanks
Lai.

[RFC,5/6] KVM: X86: Alloc pae_root shadow page

Commit Message

Comments

Patch