[RFC,V3,3/4] KVM: X86: Alloc role.pae_root shadow page

Message ID	20220330132152.4568-4-jiangshanlai@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Lai Jiangshan <jiangshanlai@gmail.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <seanjc@google.com> Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>, Jonathan Corbet <corbet@lwn.net>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, linux-doc@vger.kernel.org Subject: [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page Date: Wed, 30 Mar 2022 21:21:51 +0800 Message-Id: <20220330132152.4568-4-jiangshanlai@gmail.com> In-Reply-To: <20220330132152.4568-1-jiangshanlai@gmail.com> References: <20220330132152.4568-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: X86: Add and use shadow page with level expanded or acting as pae_root \| expand [RFC,V3,0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root [RFC,V3,1/4] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() [RFC,V3,2/4] KVM: X86: Introduce role.glevel for level expanded pagetable [RFC,V3,3/4] KVM: X86: Alloc role.pae_root shadow page [RFC,V3,4/4] KVM: X86: Use passthrough and pae_root shadow page for 32bit guests

Lai Jiangshan March 30, 2022, 1:21 p.m. UTC

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Currently pae_root is special root page, this patch adds facility to
allow using kvm_mmu_get_page() to allocate pae_root shadow page.

When kvm_mmu_get_page() is called for role.level == PT32E_ROOT_LEVEL and
vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, it will get a
PAE root pagetable and set role.pae_root=1 for freeing.

The role.pae_root bit is needed in the page role because:
  o PAE roots must be allocated below 4gb (for kvm_mmu_get_page())
  o PAE roots can not be encrypted (for kvm_mmu_get_page())
  o Must be re-encrypted when freeing (for kvm_mmu_free_page())
  o PAE root's PDPTE is special (for link_shadow_page())
  o Not share the decrypted low-address pagetable with non-PAE-root
    ones or vice verse. (for kvm_mmu_get_page(), the crucial reason)

Both role.pae_root in link_shadow_page() and in kvm_mmu_get_page() can
be possible changed to use shadow_root_level and role.level instead.

But in kvm_mmu_free_page(), it can't use vcpu->arch.mmu->shadow_root_level.

PAE roots must be allocated below 4gb (CR3 has only 32 bits).  So a
cache is introduced (mmu_pae_root_cache).

No functionality changed since this code is not activated because when
vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, kvm_mmu_get_page()
is only called for level == 1 or 2 now.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 Documentation/virt/kvm/mmu.rst  |  2 +
 arch/x86/include/asm/kvm_host.h |  9 +++-
 arch/x86/kvm/mmu/mmu.c          | 78 +++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h  |  1 +
 4 files changed, 86 insertions(+), 4 deletions(-)

Sean Christopherson April 12, 2022, 9:14 p.m. UTC | #1

On Wed, Mar 30, 2022, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> Currently pae_root is special root page, this patch adds facility to
> allow using kvm_mmu_get_page() to allocate pae_root shadow page.

I don't think this will work for shadow paging.  CR3 only has to be 32-byte aligned
for PAE paging.  Unless I'm missing something subtle in the code, KVM will incorrectly
reuse a pae_root if the guest puts multiple PAE CR3s on a single page because KVM's
gfn calculation will drop bits 11:5.

Handling this as a one-off is probably easier.  For TDP, only 32-bit KVM with NPT
benefits from reusing roots, IMO and shaving a few pages in that case is not worth
the complexity.

> @@ -332,7 +337,8 @@ union kvm_mmu_page_role {
>  		unsigned ad_disabled:1;
>  		unsigned guest_mode:1;
>  		unsigned glevel:4;
> -		unsigned :2;
> +		unsigned pae_root:1;

If we do end up adding a role bit, it can simply be "root", which may or may not
be useful for other things.  is_pae_root() is then a combo of root+level.  This
will clean up the code a bit as role.root is (mostly?) hardcoded based on the
function, e.g. root allocators set it, child allocators clear it.

> +		unsigned :1;
>  
>  		/*
>  		 * This is left at the top of the word so that
> @@ -699,6 +705,7 @@ struct kvm_vcpu_arch {
>  	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
>  	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
>  	struct kvm_mmu_memory_cache mmu_page_header_cache;
> +	void *mmu_pae_root_cache;
>  
>  	/*
>  	 * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d53037df8177..81ccaa7c1165 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -694,6 +694,35 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> +static int mmu_topup_pae_root_cache(struct kvm_vcpu *vcpu)
> +{
> +	struct page *page;
> +
> +	if (vcpu->arch.mmu->shadow_root_level != PT32E_ROOT_LEVEL)
> +		return 0;
> +	if (vcpu->arch.mmu_pae_root_cache)
> +		return 0;
> +
> +	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_DMA32);
> +	if (!page)
> +		return -ENOMEM;
> +	vcpu->arch.mmu_pae_root_cache = page_address(page);
> +
> +	/*
> +	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
> +	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
> +	 * that KVM's writes and the CPU's reads get along.  Note, this is
> +	 * only necessary when using shadow paging, as 64-bit NPT can get at
> +	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
> +	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
> +	 */
> +	if (!tdp_enabled)
> +		set_memory_decrypted((unsigned long)vcpu->arch.mmu_pae_root_cache, 1);
> +	else
> +		WARN_ON_ONCE(shadow_me_mask);
> +	return 0;
> +}
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>  	int r;
> @@ -705,6 +734,9 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  		return r;
>  	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
>  				       PT64_ROOT_MAX_LEVEL);
> +	if (r)
> +		return r;
> +	r = mmu_topup_pae_root_cache(vcpu);

This doesn't need to be called from the common mmu_topup_memory_caches(), e.g. it
will unnecessarily require allocating another DMA32 page when handling a page fault.
I'd rather call this directly kvm_mmu_load(), which also makes it more obvious
that the cache really is only used for roots.

>  	if (r)
>  		return r;
>  	if (maybe_indirect) {
> @@ -717,12 +749,23 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  					  PT64_ROOT_MAX_LEVEL);
>  }
>  

...

>  static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
>  			     struct kvm_mmu_page *sp)
>  {
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
>  	u64 spte;
>  
>  	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
>  
> -	spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> +	if (!parent_sp->role.pae_root)

Hmm, without role.root, this could be:

	if (sp->role.level == (PT32E_ROOT_level - 1) &&
	    ((__pa(sptep) & PT64_BASE_ADDR_MASK) == vcpu->arch.mmu->root.hpa))
		spte = make_pae_pdpte(sp->spt);
	else
		spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));

Which is gross, but it works.  We could also do FNAME(link_shadow_page) to send
PAE roots down a dedicated path (also gross).  Point being, I don't think we
strictly need a "root" flag unless the PAE roots are put in mmu_page_hash.

> +		spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> +	else
> +		spte = make_pae_pdpte(sp->spt);
>  
>  	mmu_spte_set(sptep, spte);
>  
> @@ -4782,6 +4847,8 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>  	role.base.level = kvm_mmu_get_tdp_level(vcpu);
>  	role.base.direct = true;
>  	role.base.has_4_byte_gpte = false;
> +	if (role.base.level == PT32E_ROOT_LEVEL)
> +		role.base.pae_root = 1;
>  
>  	return role;
>  }
> @@ -4848,6 +4915,9 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu,
>  	else
>  		role.base.level = PT64_ROOT_4LEVEL;
>  
> +	if (role.base.level == PT32E_ROOT_LEVEL)
> +		role.base.pae_root = 1;
> +
>  	return role;
>  }
>  
> @@ -4893,6 +4963,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
>  
>  	role.base.direct = false;
>  	role.base.level = kvm_mmu_get_tdp_level(vcpu);
> +	if (role.base.level == PT32E_ROOT_LEVEL)
> +		role.base.pae_root = 1;
>  
>  	return role;
>  }
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 67489a060eba..1015f33e0758 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1043,6 +1043,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		.access = 0x7,
>  		.quadrant = 0x3,
>  		.glevel = 0xf,
> +		.pae_root = 0x1,
>  	};
>  
>  	/*
> -- 
> 2.19.1.6.gb485710b
>

Lai Jiangshan April 14, 2022, 9:07 a.m. UTC | #2

On Wed, Apr 13, 2022 at 5:14 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Mar 30, 2022, Lai Jiangshan wrote:
> > From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> >
> > Currently pae_root is special root page, this patch adds facility to
> > allow using kvm_mmu_get_page() to allocate pae_root shadow page.
>
> I don't think this will work for shadow paging.  CR3 only has to be 32-byte aligned
> for PAE paging.  Unless I'm missing something subtle in the code, KVM will incorrectly
> reuse a pae_root if the guest puts multiple PAE CR3s on a single page because KVM's
> gfn calculation will drop bits 11:5.

I forgot about it.

>
> Handling this as a one-off is probably easier.  For TDP, only 32-bit KVM with NPT
> benefits from reusing roots, IMO and shaving a few pages in that case is not worth
> the complexity.
>

I liked the one-off idea yesterday and started trying it.

But things were not going as smoothly as I thought.  There are too
many corner cases to cover.  Maybe I don't get what you envisioned.

one-off shadow pages must not be in the hash, must be freed
immediately in kvm_mmu_free_roots(), taken care in
kvm_mmu_prepare_zap_page() and so on.

When the guest is 32bit, the host has to free and allocate sp
every time when the guest changes cr3.  It will be a regression
when !TDP.

one-off shadow pages are too distinguished from others.

When using one-off shadow pages, role.passthough can be one
bit and be used only for 5-level NPT L0 for 4-level NPT L1,
which is neat.  And role.pae_root can be removed.

I want the newly added shadow pages to fit into the current
shadow page management and root management.

I'm going to add sp->pae_off (u16) which is 11:5 of the cr3
when the guest is PAE paging. It needs only less than 10 lines
of code.

Thanks.
Lai

Paolo Bonzini April 14, 2022, 9:08 a.m. UTC | #3

On 4/14/22 11:07, Lai Jiangshan wrote:
>> I don't think this will work for shadow paging.  CR3 only has to be 32-byte aligned
>> for PAE paging.  Unless I'm missing something subtle in the code, KVM will incorrectly
>> reuse a pae_root if the guest puts multiple PAE CR3s on a single page because KVM's
>> gfn calculation will drop bits 11:5.
> 
> I forgot about it.


Isn't the pae_root always rebuilt by

         if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
                 kvm_mmu_free_roots(vcpu->kvm, mmu, KVM_MMU_ROOT_CURRENT);

in load_pdptrs?  I think reuse cannot happen.

Paolo

Lai Jiangshan April 14, 2022, 9:32 a.m. UTC | #4

On Thu, Apr 14, 2022 at 5:08 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 4/14/22 11:07, Lai Jiangshan wrote:
> >> I don't think this will work for shadow paging.  CR3 only has to be 32-byte aligned
> >> for PAE paging.  Unless I'm missing something subtle in the code, KVM will incorrectly
> >> reuse a pae_root if the guest puts multiple PAE CR3s on a single page because KVM's
> >> gfn calculation will drop bits 11:5.
> >
> > I forgot about it.
>
>
> Isn't the pae_root always rebuilt by
>
>          if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
>                  kvm_mmu_free_roots(vcpu->kvm, mmu, KVM_MMU_ROOT_CURRENT);
>
> in load_pdptrs?  I think reuse cannot happen.
>

In this patchset, root sp can be reused if it is found from the hash,
including new pae root.

All new kinds of sp added in this patchset are in the hash too.

No more special root pages.

kvm_mmu_free_roots() can not free those new types of sp if they are still
valid.  And different vcpu can use the same pae root sp if the guest cr3
of the vcpus are the same.

And new pae root can be put in prev_root too (not implemented yet)
because they are not too special anymore.  As long as sp->gfn, sp->pae_off,
sp->role are matched, they can be reused.

Paolo Bonzini April 14, 2022, 10:04 a.m. UTC | #5

On 4/14/22 11:32, Lai Jiangshan wrote:
> kvm_mmu_free_roots() can not free those new types of sp if they are still
> valid.  And different vcpu can use the same pae root sp if the guest cr3
> of the vcpus are the same.

Right, but then load_pdptrs only needs to zap the page before (or 
instead of) calling kvm_mmu_free_roots().

Paolo

> And new pae root can be put in prev_root too (not implemented yet)
> because they are not too special anymore.  As long as sp->gfn, sp->pae_off,
> sp->role are matched, they can be reused.

Lai Jiangshan April 14, 2022, 11:06 a.m. UTC | #6

On Thu, Apr 14, 2022 at 6:04 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 4/14/22 11:32, Lai Jiangshan wrote:
> > kvm_mmu_free_roots() can not free those new types of sp if they are still
> > valid.  And different vcpu can use the same pae root sp if the guest cr3
> > of the vcpus are the same.
>
> Right, but then load_pdptrs only needs to zap the page before (or
> instead of) calling kvm_mmu_free_roots().
>

Guest PAE page is write-protected instead now (see patch4) and
kvm_mmu_pte_write() needs to handle this special write operation
with respect to sp->pae_off (todo).
And load_pdptrs() doesn't need to check if the pdptrs are changed.

The semantics will be changed. When the guest updates its PAE root,
the hwTLB will not be updated/flushed immediately until some change
to CRx, but after this change, it will be flushed immediately.

Could we fix 5-level NPT L0 for 4-level NPT L1 only first? it is
a real bug.  I separated it out when I tried to implement one-off
shadow pages.

Lai Jiangshan April 14, 2022, 1:35 p.m. UTC | #7

On Thu, Apr 14, 2022 at 5:32 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:

>
> All new kinds of sp added in this patchset are in the hash too.
>

I think role.guest_pae_root is needed to distinguish it from
a sp for a level-3 guest page in a 4-level pagetable.

Or just role.guest_root_level(or role.root_level) and it can replace
role.passthrough_depth and role.guest_pae_root and role.pae_root.

role.pae_root will be

(role.root_level == 3 || role.root_level == 2) && role.level == 3 &&
(host is 32bit || !tdp_enabled)

Paolo Bonzini April 14, 2022, 2:12 p.m. UTC | #8

On 4/14/22 13:06, Lai Jiangshan wrote:
>> Right, but then load_pdptrs only needs to zap the page before (or
>> instead of) calling kvm_mmu_free_roots().
>>
> 
> Guest PAE page is write-protected instead now (see patch4) and
> kvm_mmu_pte_write() needs to handle this special write operation
> with respect to sp->pae_off (todo).
> And load_pdptrs() doesn't need to check if the pdptrs are changed.

Write-protecting the PDPTR page is unnecessary, the PDPTRs cannot change 
without another CR3.  That should be easy to do in account_shadowed and 
unaccount_shadowed

> I think role.guest_pae_root is needed to distinguish it from
> a sp for a level-3 guest page in a 4-level pagetable.
>
> Or just role.guest_root_level(or role.root_level) and it can replace
> role.passthrough_depth and role.guest_pae_root and role.pae_root.

Yes, I agree.  Though this would also get change patch 1 substantially, 
so I'll wait for you to respin.

Paolo

Sean Christopherson April 14, 2022, 2:42 p.m. UTC | #9

On Thu, Apr 14, 2022, Paolo Bonzini wrote:
> On 4/14/22 13:06, Lai Jiangshan wrote:
> > > Right, but then load_pdptrs only needs to zap the page before (or
> > > instead of) calling kvm_mmu_free_roots().
> > > 
> > 
> > Guest PAE page is write-protected instead now (see patch4) and
> > kvm_mmu_pte_write() needs to handle this special write operation
> > with respect to sp->pae_off (todo).
> > And load_pdptrs() doesn't need to check if the pdptrs are changed.
> 
> Write-protecting the PDPTR page is unnecessary, the PDPTRs cannot change
> without another CR3.  That should be easy to do in account_shadowed and
> unaccount_shadowed

Technically that's not true under SVM?

  Under SVM, however, when the processor is in guest mode with PAE enabled, the
  guest PDPT entries are not cached or validated at this point, but instead are
  loaded and checked on demand in the normal course of address translation, just
  like page directory and page table entries

Sean Christopherson April 14, 2022, 2:52 p.m. UTC | #10

On Thu, Apr 14, 2022, Lai Jiangshan wrote:
> On Wed, Apr 13, 2022 at 5:14 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Mar 30, 2022, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > >
> > > Currently pae_root is special root page, this patch adds facility to
> > > allow using kvm_mmu_get_page() to allocate pae_root shadow page.
> >
> > I don't think this will work for shadow paging.  CR3 only has to be 32-byte aligned
> > for PAE paging.  Unless I'm missing something subtle in the code, KVM will incorrectly
> > reuse a pae_root if the guest puts multiple PAE CR3s on a single page because KVM's
> > gfn calculation will drop bits 11:5.
> 
> I forgot about it.
> 
> >
> > Handling this as a one-off is probably easier.  For TDP, only 32-bit KVM with NPT
> > benefits from reusing roots, IMO and shaving a few pages in that case is not worth
> > the complexity.
> >
> 
> I liked the one-off idea yesterday and started trying it.
> 
> But things were not going as smoothly as I thought.  There are too
> many corner cases to cover.  Maybe I don't get what you envisioned.

Hmm, I believe I was thinking that each vCPU could have a pre-allocated pae_root
shadow page, i.e. keep pae_root, but make it

	struct kvm_mmu_page pae_root;

The alloc/free paths would still need special handling, but at least in theory,
all other code that expects root a shadow page will Just Work.

[RFC,V3,3/4] KVM: X86: Alloc role.pae_root shadow page

Commit Message

Comments

Patch