diff mbox series

[RFC,11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

Message ID 20211110223010.1392399-12-bgardon@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Optimize disabling dirty logging | expand

Commit Message

Ben Gardon Nov. 10, 2021, 10:30 p.m. UTC
In the interest of devloping a version of make_spte that can function
without a vCPU pointer, factor out the shadow_zero_mask to be an
additional argument to the function.

No functional change intended.


Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/spte.c | 11 +++++++----
 arch/x86/kvm/mmu/spte.h |  3 ++-
 2 files changed, 9 insertions(+), 5 deletions(-)

Comments

Paolo Bonzini Nov. 10, 2021, 10:44 p.m. UTC | #1
On 11/10/21 23:30, Ben Gardon wrote:
> -	WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
> +	WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
>   		  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
> -		  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
> +		  get_rsvd_bits(shadow_zero_check, spte, level));

Hmm, there is a deeper issue here, in that when using EPT/NPT (on either 
the legacy aka shadow or the TDP MMU) large parts of vcpu->arch.mmu are 
really the same for all vCPUs.  The only thing that varies is those 
parts that actually depend on the guest's paging mode---the extended 
role, the reserved bits, etc.  Those are needed by the emulator, but 
don't really belong in vcpu->arch.mmu when EPT/NPT is in use.

I wonder if there's room for splitting kvm_mmu in two parts, such as 
kvm_mmu and kvm_guest_paging_context, and possibly change the walk_mmu 
pointer into a pointer to kvm_guest_paging_context.  This way the 
EPT/NPT MMU (again either shadow or TDP) can be moved to kvm->arch.  It 
should simplify this series and also David's work on eager page splitting.

I'm not asking you to do this, of course, but perhaps I can trigger 
Sean's itch to refactor stuff. :)

Paolo
Ben Gardon Nov. 10, 2021, 11:49 p.m. UTC | #2
On Wed, Nov 10, 2021 at 2:45 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 11/10/21 23:30, Ben Gardon wrote:
> > -     WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
> > +     WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
> >                 "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
> > -               get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
> > +               get_rsvd_bits(shadow_zero_check, spte, level));
>
> Hmm, there is a deeper issue here, in that when using EPT/NPT (on either
> the legacy aka shadow or the TDP MMU) large parts of vcpu->arch.mmu are
> really the same for all vCPUs.  The only thing that varies is those
> parts that actually depend on the guest's paging mode---the extended
> role, the reserved bits, etc.  Those are needed by the emulator, but
> don't really belong in vcpu->arch.mmu when EPT/NPT is in use.
>
> I wonder if there's room for splitting kvm_mmu in two parts, such as
> kvm_mmu and kvm_guest_paging_context, and possibly change the walk_mmu
> pointer into a pointer to kvm_guest_paging_context.  This way the
> EPT/NPT MMU (again either shadow or TDP) can be moved to kvm->arch.  It
> should simplify this series and also David's work on eager page splitting.
>
> I'm not asking you to do this, of course, but perhaps I can trigger
> Sean's itch to refactor stuff. :)
>
> Paolo
>

I think that's a great idea. I'm frequently confused as to why the
struct kvm_mmu is a per-vcpu construct as opposed to being VM-global.
Moving part of the struct to be a member for struct kvm would also
open the door to formalizing the MMU interface a little better and
perhaps even reveal more MMU code that can be consolidated across
architectures.
Sean Christopherson Nov. 11, 2021, 1:18 a.m. UTC | #3
On Wed, Nov 10, 2021, Ben Gardon wrote:
> On Wed, Nov 10, 2021 at 2:45 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 11/10/21 23:30, Ben Gardon wrote:
> > > -     WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
> > > +     WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
> > >                 "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
> > > -               get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
> > > +               get_rsvd_bits(shadow_zero_check, spte, level));
> >
> > Hmm, there is a deeper issue here, in that when using EPT/NPT (on either
> > the legacy aka shadow or the TDP MMU) large parts of vcpu->arch.mmu are
> > really the same for all vCPUs.  The only thing that varies is those
> > parts that actually depend on the guest's paging mode---the extended
> > role, the reserved bits, etc.  Those are needed by the emulator, but
> > don't really belong in vcpu->arch.mmu when EPT/NPT is in use.
> >
> > I wonder if there's room for splitting kvm_mmu in two parts, such as
> > kvm_mmu and kvm_guest_paging_context, and possibly change the walk_mmu
> > pointer into a pointer to kvm_guest_paging_context.  This way the
> > EPT/NPT MMU (again either shadow or TDP) can be moved to kvm->arch.  It
> > should simplify this series and also David's work on eager page splitting.
> >
> > I'm not asking you to do this, of course, but perhaps I can trigger
> > Sean's itch to refactor stuff. :)
> >
> > Paolo
> >
> 
> I think that's a great idea. I'm frequently confused as to why the
> struct kvm_mmu is a per-vcpu construct as opposed to being VM-global.
> Moving part of the struct to be a member for struct kvm would also
> open the door to formalizing the MMU interface a little better and
> perhaps even reveal more MMU code that can be consolidated across
> architectures.

But what would you actually move?  Even shadow_zero_check barely squeaks by,
e.g. if NX is ever used to for NPT, then maybe it stops being a per-VM setting.

Going through the fields...

These are all related to guest context:

	unsigned long (*get_guest_pgd)(struct kvm_vcpu *vcpu);
	u64 (*get_pdptr)(struct kvm_vcpu *vcpu, int index);
	int (*page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
	void (*inject_page_fault)(struct kvm_vcpu *vcpu,
				  struct x86_exception *fault);
	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gpa_t gva_or_gpa,
			    u32 access, struct x86_exception *exception);
	gpa_t (*translate_gpa)(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
			       struct x86_exception *exception);
	int (*sync_page)(struct kvm_vcpu *vcpu,
			 struct kvm_mmu_page *sp);
	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
	union kvm_mmu_role mmu_role;
	u8 root_level;
	u8 permissions[16];
	u32 pkru_mask;
	struct rsvd_bits_validate guest_rsvd_check;
	u64 pdptrs[4];
	gpa_t root_pgd;

One field, ept_ad, can be straight deleted as it's redundant with respect to
the above mmu_role.ad_disabled.

	u8 ept_ad;

Ditto for direct_map flag (mmu_role.direct) and shadow_root_level (mmu_role.level).
I haven't bothered to yank those because they have a lot of touchpoints.

	bool direct_map;
	u8 shadow_root_level;

The prev_roots could be dropped if TDP roots were tracked per-VM, but we'd still
want an equivalent for !TDP and nTDP MMUs.

	struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];

shadow_zero_check can be made per-VM if all vCPUs are required to have the same
cpuid.MAXPHYADDR or if we remove the (IMO) pointless 5-level vs. 4-level behavior,
which by-the-by, has my vote since we could make shadow_zero_check _global_, not
just per-VM, and everything I've heard is that the extra level has no measurable
performance overhead.

	struct rsvd_bits_validate shadow_zero_check;

And that leaves us with:
	hpa_t root_hpa;

	u64 *pae_root;
	u64 *pml4_root;
	u64 *pml5_root;

Of those, _none_ of them can be per-VM, because they are all nothing more than
shadow pages, and thus cannot be per-VM unless there is exactly one set of TDP
page tables for the guest.  Even if/when we strip the unnecessary role bits from
these for TDP (on my todo list), we still need up to three sets of page tables:

	1. Normal
	2. SMM
	3. Guest (if L1 doesn't use TDP)

So I suppose we could refactor KVM to explicitly track its three possible TDP
roots, but I don't think it buys us anything and would complicate supporting
!TDP as well as nTDP.
Sean Christopherson Nov. 11, 2021, 1:44 a.m. UTC | #4
On Thu, Nov 11, 2021, Sean Christopherson wrote:
> These are all related to guest context:
> 	int (*page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> 	void (*inject_page_fault)(struct kvm_vcpu *vcpu,
> 				  struct x86_exception *fault);

That's incorrect, page_fault() and inject_page_fault() could be per-VM.  Neither
is particularly interesting though.
Paolo Bonzini Nov. 11, 2021, 7:06 a.m. UTC | #5
On 11/11/21 02:18, Sean Christopherson wrote:
> But what would you actually move?  Even shadow_zero_check barely squeaks by,
> e.g. if NX is ever used to for NPT, then maybe it stops being a per-VM setting.

Hmm, I think it would still be per-VM, just like 32-bit shadow page tables
are always built for EFER.NXE=CR4.PAE=1.  Anyway, the rough sketch is to have
three structs:

* struct kvm_mmu_kind has the function pointers and the state that is
needed to operate on page tables

* struct kvm_mmu has the function pointers and the state that is
needed while the vCPU runs, including the role

* struct kvm_paging_context has the stuff related to emulation;
shadow page tables of course needs it but EPT/NPT do not (with
either the old or the new MMU)

So you'd have a "struct kvm_mmu_kind direct_mmu" in struct kvm_arch (for
either legacy EPT/NPT or the new MMU), and

	struct kvm_mmu_kind shadow_mmu;
	struct kvm_mmu root_mmu;		/* either TDP or shadow */
	struct kvm_mmu tdp12_mmu;		/* always shadow */
	struct kvm_mmu *mmu;			/* either &kvm->direct_mmu or &vcpu->shadow_mmu */
	struct kvm_paging_context root_walk;	/* maybe unified with walk01 below? dunno yet */
	struct kvm_paging_context walk01;
	struct kvm_paging_context walk12;
	struct kvm_paging_context *walk;	/* either &vcpu->root_walk or &vcpu->walk12 */

in struct kvm_vcpu_arch.  struct kvm_mmu* has a pointer to
struct kvm_mmu_kind*; however, if an spte.c function does not need
the data in struct kvm_mmu_state*, it can take a struct kvm_mmu_kind*
and it won't need a vCPU.  Likewise the TDP MMU knows its kvm_mmu_kind
is always in &kvm->direct_mmu so it can take a struct kvm* if the struct
kvm_mmu_state* is not needed.

The first part of the refactoring would be to kill the nested_mmu
and walk_mmu, replacing them by &vcpu->walk12 and vcpu->walk
respectively.  The half-useless nested_mmu has always bothered me,
I was going to play with it anyway because I want to remove the
kvm_mmu_reset_context from CR0.WP writes, I'll see if I get
something useful out of it.

Paolo
Sean Christopherson Nov. 18, 2021, 2:05 a.m. UTC | #6
On Wed, Nov 10, 2021, Ben Gardon wrote:
> In the interest of devloping a version of make_spte that can function
> without a vCPU pointer, factor out the shadow_zero_mask to be an
> additional argument to the function.
> 
> No functional change intended.
> 
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/spte.c | 11 +++++++----
>  arch/x86/kvm/mmu/spte.h |  3 ++-
>  2 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index b7271daa06c5..d3b059e96c6e 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -93,7 +93,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	       struct kvm_memory_slot *slot, unsigned int pte_access,
>  	       gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
>  	       bool can_unsync, bool host_writable, bool ad_need_write_protect,
> -	       u64 mt_mask, u64 *new_spte)
> +	       u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,

Ugh, so I had a big email written about how I think we should add a module param
to control 4-level vs. 5-level for all TDP pages, but then I realized it wouldn't
work for nested EPT because that follows the root level used by L1.  We could
still make a global non_nested_tdp_shadow_zero_check or whatever, but then make_spte()
would have to do some work to find the right rsvd_bits_validate, and the end result
would likely be a mess.

One idea to avoid exploding make_spte() would be to add a backpointer to the MMU
in kvm_mmu_page.  I don't love the idea, but I also don't love passing in rsvd_bits_validate.
Sean Christopherson Nov. 18, 2021, 3:29 a.m. UTC | #7
On Thu, Nov 18, 2021, Sean Christopherson wrote:
> On Wed, Nov 10, 2021, Ben Gardon wrote:
> > In the interest of devloping a version of make_spte that can function
> > without a vCPU pointer, factor out the shadow_zero_mask to be an
> > additional argument to the function.
> > 
> > No functional change intended.
> > 
> > 
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu/spte.c | 11 +++++++----
> >  arch/x86/kvm/mmu/spte.h |  3 ++-
> >  2 files changed, 9 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index b7271daa06c5..d3b059e96c6e 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -93,7 +93,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >  	       struct kvm_memory_slot *slot, unsigned int pte_access,
> >  	       gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
> >  	       bool can_unsync, bool host_writable, bool ad_need_write_protect,
> > -	       u64 mt_mask, u64 *new_spte)
> > +	       u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
> 
> Ugh, so I had a big email written about how I think we should add a module param
> to control 4-level vs. 5-level for all TDP pages, but then I realized it wouldn't
> work for nested EPT because that follows the root level used by L1.  We could
> still make a global non_nested_tdp_shadow_zero_check or whatever, but then make_spte()
> would have to do some work to find the right rsvd_bits_validate, and the end result
> would likely be a mess.
> 
> One idea to avoid exploding make_spte() would be to add a backpointer to the MMU
> in kvm_mmu_page.  I don't love the idea, but I also don't love passing in rsvd_bits_validate.

Another idea.  The only difference between 5-level and 4-level is that 5-level
fills in index [4], and I'm pretty sure 4-level doesn't touch that index.  For
PAE NPT (32-bit SVM), the shadow root level will never change, so that's not an issue.

Nested NPT is the only case where anything for an EPT/NPT MMU can change, because
that follows EFER.NX.

In other words, the non-nested TDP reserved bits don't need to be recalculated
regardless of level, they can just fill in 5-level and leave it be.

E.g. something like the below.  The sp->role.direct check could be removed if we
forced EFER.NX for nested NPT.

It's a bit ugly in that we'd pass both @kvm and @vcpu, so that needs some more
thought, but at minimum it means there's no need to recalc the reserved bits.

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 84e64dbdd89e..05db9b89dc53 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -95,10 +95,18 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
               u64 old_spte, bool prefetch, bool can_unsync,
               bool host_writable, u64 *new_spte)
 {
+       struct rsvd_bits_validate *rsvd_check;
        int level = sp->role.level;
        u64 spte = SPTE_MMU_PRESENT_MASK;
        bool wrprot = false;

+       if (vcpu) {
+               rsvd_check = vcpu->arch.mmu->shadow_zero_check;
+       } else {
+               WARN_ON_ONCE(!tdp_enabled || !sp->role.direct);
+               rsvd_check = tdp_shadow_rsvd_bits;
+       }
+
        if (sp->role.ad_disabled)
                spte |= SPTE_TDP_AD_DISABLED_MASK;
        else if (kvm_mmu_page_ad_need_write_protect(sp))
@@ -177,9 +185,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
        if (prefetch)
                spte = mark_spte_for_access_track(spte);

-       WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
+       WARN_ONCE(is_rsvd_spte(rsvd_check, spte, level),
                  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
-                 get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
+                 get_rsvd_bits(rsvd_check, spte, level));

        if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
                /* Enforced by kvm_mmu_hugepage_adjust. */
Sean Christopherson Nov. 18, 2021, 4:37 p.m. UTC | #8
On Thu, Nov 18, 2021, Sean Christopherson wrote:
> Another idea.  The only difference between 5-level and 4-level is that 5-level
> fills in index [4], and I'm pretty sure 4-level doesn't touch that index.  For
> PAE NPT (32-bit SVM), the shadow root level will never change, so that's not an issue.
> 
> Nested NPT is the only case where anything for an EPT/NPT MMU can change, because
> that follows EFER.NX.
> 
> In other words, the non-nested TDP reserved bits don't need to be recalculated
> regardless of level, they can just fill in 5-level and leave it be.
> 
> E.g. something like the below.  The sp->role.direct check could be removed if we
> forced EFER.NX for nested NPT.
> 
> It's a bit ugly in that we'd pass both @kvm and @vcpu, so that needs some more
> thought, but at minimum it means there's no need to recalc the reserved bits.

Ok, I think my final vote is to have the reserved bits passed in, but with the
non-nested TDP reserved bits being computed at MMU init.

I would also prefer to keep the existing make_spte() name so that there's no churn
in those call sites, and to make the relationship between the wrapper, mask_spte(),
and the "real" helper, __make_spte(), more obvious and aligned with the usual
kernel style.

So with the kvm_vcpu_ad_need_write_protect() change and my proposed hack-a-fix for
kvm_x86_get_mt_mask(), the end result would look like:

bool __make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
		 struct kvm_memory_slot *slot, unsigned int pte_access,
		 gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
		 bool can_unsync, bool host_writable, u64 *new_spte,
		 struct rsvd_bits_validate *shadow_rsvd_bits)
{
	int level = sp->role.level;
	u64 spte = SPTE_MMU_PRESENT_MASK;
	bool wrprot = false;

	if (sp->role.ad_disabled)
		spte |= SPTE_TDP_AD_DISABLED_MASK;
	else if (kvm_mmu_page_ad_need_write_protect(sp))
		spte |= SPTE_TDP_AD_WRPROT_ONLY_MASK;

	/*
	 * For the EPT case, shadow_present_mask is 0 if hardware
	 * supports exec-only page table entries.  In that case,
	 * ACC_USER_MASK and shadow_user_mask are used to represent
	 * read access.  See FNAME(gpte_access) in paging_tmpl.h.
	 */
	spte |= shadow_present_mask;
	if (!prefetch)
		spte |= spte_shadow_accessed_mask(spte);

	if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&
	    is_nx_huge_page_enabled()) {
		pte_access &= ~ACC_EXEC_MASK;
	}

	if (pte_access & ACC_EXEC_MASK)
		spte |= shadow_x_mask;
	else
		spte |= shadow_nx_mask;

	if (pte_access & ACC_USER_MASK)
		spte |= shadow_user_mask;

	if (level > PG_LEVEL_4K)
		spte |= PT_PAGE_SIZE_MASK;
	if (tdp_enabled)
		spte |= static_call(kvm_x86_get_mt_mask)(kvm, gfn,
			kvm_is_mmio_pfn(pfn));

	if (host_writable)
		spte |= shadow_host_writable_mask;
	else
		pte_access &= ~ACC_WRITE_MASK;

	if (!kvm_is_mmio_pfn(pfn))
		spte |= shadow_me_mask;

	spte |= (u64)pfn << PAGE_SHIFT;

	if (pte_access & ACC_WRITE_MASK) {
		spte |= PT_WRITABLE_MASK | shadow_mmu_writable_mask;

		/*
		 * Optimization: for pte sync, if spte was writable the hash
		 * lookup is unnecessary (and expensive). Write protection
		 * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
		 * Same reasoning can be applied to dirty page accounting.
		 */
		if (is_writable_pte(old_spte))
			goto out;

		/*
		 * Unsync shadow pages that are reachable by the new, writable
		 * SPTE.  Write-protect the SPTE if the page can't be unsync'd,
		 * e.g. it's write-tracked (upper-level SPs) or has one or more
		 * shadow pages and unsync'ing pages is not allowed.
		 */
		if (mmu_try_to_unsync_pages(kvm, slot, gfn, can_unsync, prefetch)) {
			pgprintk("%s: found shadow page for %llx, marking ro\n",
				 __func__, gfn);
			wrprot = true;
			pte_access &= ~ACC_WRITE_MASK;
			spte &= ~(PT_WRITABLE_MASK | shadow_mmu_writable_mask);
		}
	}

	if (pte_access & ACC_WRITE_MASK)
		spte |= spte_shadow_dirty_mask(spte);

out:
	if (prefetch)
		spte = mark_spte_for_access_track(spte);

	WARN_ONCE(is_rsvd_spte(shadow_rsvd_bits), spte, level),
		  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
		  get_rsvd_bits(&shadow_rsvd_bits, spte, level));

	if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
		/* Enforced by kvm_mmu_hugepage_adjust. */
		WARN_ON(level > PG_LEVEL_4K);
		mark_page_dirty_in_slot(kvm, slot, gfn);
	}

	*new_spte = spte;
	return wrprot;
}

bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
	       struct kvm_memory_slot *slot,
	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
	       u64 old_spte, bool prefetch, bool can_unsync,
	       bool host_writable, u64 *new_spte)
{
	return __make_spte(vcpu->kvm, sp, slot, pte_access, gfn, pfn, old_spte,
			   prefetch, can_unsync, host_writable, new_spte,
			   &vcpu->arch.mmu->shadow_zero_check);
}
Paolo Bonzini Nov. 18, 2021, 5:19 p.m. UTC | #9
On 11/18/21 17:37, Sean Christopherson wrote:
>> It's a bit ugly in that we'd pass both @kvm and @vcpu, so that needs some more
>> thought, but at minimum it means there's no need to recalc the reserved bits.
>
> Ok, I think my final vote is to have the reserved bits passed in, but with the
> non-nested TDP reserved bits being computed at MMU init.

Yes, and that's also where I was getting with the idea of moving part of 
the "direct" MMU (man, naming these things is so hard) to struct kvm: 
split the per-vCPU state from the constant one and initialize the latter 
just once.  Though perhaps I was putting the cart slightly before the horse.

On the topic of naming, we have a lot of things to name:

- the two MMU codebases: you Googlers are trying to grandfather "legacy" 
and "TDP" into upstream, but that's not a great name because the former 
is used also when shadowing EPT/NPT.  I'm thinking of standardizing on 
"shadow" and "TDP" (it's not perfect because of the 32-bit and tdp_mmu=0 
cases, but it's a start).  Maybe even split parts of mmu.c out into 
shadow_mmu.c.

- the two walkers (I'm quite convinced of splitting that part out of 
struct kvm_mmu and getting rid of walk_mmu/nested_mmu): that's easy, it 
can be walk01 and walk12 with "walk" pointing to one of them

- the two MMUs: with nested_mmu gone, root_mmu and guest_mmu are much 
less confusing and we can keep those names.

Paolo
Sean Christopherson Nov. 18, 2021, 6:02 p.m. UTC | #10
On Thu, Nov 18, 2021, Paolo Bonzini wrote:
> On 11/18/21 17:37, Sean Christopherson wrote:
> > > It's a bit ugly in that we'd pass both @kvm and @vcpu, so that needs some more
> > > thought, but at minimum it means there's no need to recalc the reserved bits.
> > 
> > Ok, I think my final vote is to have the reserved bits passed in, but with the
> > non-nested TDP reserved bits being computed at MMU init.
> 
> Yes, and that's also where I was getting with the idea of moving part of the
> "direct" MMU (man, naming these things is so hard) to struct kvm: split the
> per-vCPU state from the constant one and initialize the latter just once.
> Though perhaps I was putting the cart slightly before the horse.
> 
> On the topic of naming, we have a lot of things to name:
> 
> - the two MMU codebases: you Googlers are trying to grandfather "legacy" and
> "TDP" into upstream

Heh, I think that's like 99.9% me.

> but that's not a great name because the former is used also when shadowing
> EPT/NPT.  I'm thinking of standardizing on "shadow" and "TDP" (it's not
> perfect because of the 32-bit and tdp_mmu=0 cases, but it's a start).  Maybe
> even split parts of mmu.c out into shadow_mmu.c.

But shadow is flat out wrong until EPT and NPT support is ripped out of the "legacy"
MMU.

> - the two walkers (I'm quite convinced of splitting that part out of struct
> kvm_mmu and getting rid of walk_mmu/nested_mmu): that's easy, it can be
> walk01 and walk12 with "walk" pointing to one of them

I am all in favor of walk01 and walk12, the guest_mmu vs. nested_mmu confusion
is painful.

> - the two MMUs: with nested_mmu gone, root_mmu and guest_mmu are much less
> confusing and we can keep those names.

I would prefer root_mmu and nested_tdp_mmu.  guest_mmu is misleading because its
not used for all cases of sp->role.guest_mode=1, i.e. when L1 is not using TDP
then guest_mode=1 but KVM isn't using guest_mmu.
Paolo Bonzini Nov. 18, 2021, 6:07 p.m. UTC | #11
On 11/18/21 19:02, Sean Christopherson wrote:
>> but that's not a great name because the former is used also when shadowing
>> EPT/NPT.  I'm thinking of standardizing on "shadow" and "TDP" (it's not
>> perfect because of the 32-bit and tdp_mmu=0 cases, but it's a start).  Maybe
>> even split parts of mmu.c out into shadow_mmu.c.
> But shadow is flat out wrong until EPT and NPT support is ripped out of the "legacy"
> MMU.

Yeah, that's true.  "full" MMU? :)

>> - the two walkers (I'm quite convinced of splitting that part out of struct
>> kvm_mmu and getting rid of walk_mmu/nested_mmu): that's easy, it can be
>> walk01 and walk12 with "walk" pointing to one of them
>
> I am all in favor of walk01 and walk12, the guest_mmu vs. nested_mmu confusion
> is painful.
> 
>> - the two MMUs: with nested_mmu gone, root_mmu and guest_mmu are much less
>> confusing and we can keep those names.
>
> I would prefer root_mmu and nested_tdp_mmu.  guest_mmu is misleading because its
> not used for all cases of sp->role.guest_mode=1, i.e. when L1 is not using TDP
> then guest_mode=1 but KVM isn't using guest_mmu.

Ok, that sounds good too.

Paolo
Sean Christopherson Nov. 18, 2021, 6:14 p.m. UTC | #12
On Thu, Nov 18, 2021, Paolo Bonzini wrote:
> On 11/18/21 19:02, Sean Christopherson wrote:
> > > but that's not a great name because the former is used also when shadowing
> > > EPT/NPT.  I'm thinking of standardizing on "shadow" and "TDP" (it's not
> > > perfect because of the 32-bit and tdp_mmu=0 cases, but it's a start).  Maybe
> > > even split parts of mmu.c out into shadow_mmu.c.
> > But shadow is flat out wrong until EPT and NPT support is ripped out of the "legacy"
> > MMU.
> 
> Yeah, that's true.  "full" MMU? :)

Or we could just rip out non-nested TDP support from the legacy MMU and call it
the shadow MMU :-)
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index b7271daa06c5..d3b059e96c6e 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -93,7 +93,8 @@  bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       struct kvm_memory_slot *slot, unsigned int pte_access,
 	       gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
 	       bool can_unsync, bool host_writable, bool ad_need_write_protect,
-	       u64 mt_mask, u64 *new_spte)
+	       u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
+	       u64 *new_spte)
 {
 	int level = sp->role.level;
 	u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -176,9 +177,9 @@  bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	if (prefetch)
 		spte = mark_spte_for_access_track(spte);
 
-	WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
+	WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
 		  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
-		  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
+		  get_rsvd_bits(shadow_zero_check, spte, level));
 
 	if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
 		/* Enforced by kvm_mmu_hugepage_adjust. */
@@ -198,10 +199,12 @@  bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	bool ad_need_write_protect = kvm_vcpu_ad_need_write_protect(vcpu);
 	u64 mt_mask = static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
 						       kvm_is_mmio_pfn(pfn));
+	struct rsvd_bits_validate *shadow_zero_check = &vcpu->arch.mmu->shadow_zero_check;
 
 	return make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
 			 prefetch, can_unsync, host_writable,
-			 ad_need_write_protect, mt_mask, new_spte);
+			 ad_need_write_protect, mt_mask, shadow_zero_check,
+			 new_spte);
 
 }
 
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index e739f2ebf844..6134a10487c4 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -333,7 +333,8 @@  bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       struct kvm_memory_slot *slot, unsigned int pte_access,
 	       gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
 	       bool can_unsync, bool host_writable, bool ad_need_write_protect,
-	       u64 mt_mask, u64 *new_spte);
+	       u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
+	       u64 *new_spte);
 bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		    struct kvm_memory_slot *slot,
 		    unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,