diff mbox series

[RFC,v2,41/69] KVM: x86: Add infrastructure for stolen GPA bits

Message ID c958a131ded780808a687b0f25c02127ca14418a.1625186503.git.isaku.yamahata@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: X86: TDX support | expand

Commit Message

Isaku Yamahata July 2, 2021, 10:04 p.m. UTC
From: Rick Edgecombe <rick.p.edgecombe@intel.com>

Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
perspective) to a single GPA (from a memslot perspective). GPA alising
will be used to repurpose GPA bits as attribute bits, e.g. to expose an
execute-only permission bit to the guest. To keep the implementation
simple (relatively speaking), GPA aliasing is only supported via TDP.

Today KVM assumes two things that are broken by GPA aliasing.
  1. GPAs coming from hardware can be simply shifted to get the GFNs.
  2. GPA bits 51:MAXPHYADDR are reserved to zero.

With GPA aliasing, translating a GPA to GFN requires masking off the
repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.

To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
that is, bits stolen from the GPA to act as new virtualized attribute
bits. A bit in the mask will cause the MMU code to create aliases of the
GPA. It can also be used to find the GFN out of a GPA coming from a tdp
fault.

To handle case (1) from above, retain any stolen bits when passing a GPA
in KVM's MMU code, but strip them when converting to a GFN so that the
GFN contains only the "real" GFN, i.e. never has repurposed bits set.

GFNs (without stolen bits) continue to be used to:
	-Specify physical memory by userspace via memslots
	-Map GPAs to TDP PTEs via RMAP
	-Specify dirty tracking and write protection
	-Look up MTRR types
	-Inject async page faults

Since there are now multiple aliases for the same aliased GPA, when
userspace memory backing the memslots is paged out, both aliases need to be
modified. Fortunately this happens automatically. Since rmap supports
multiple mappings for the same GFN for PTE shadowing based paging, by
adding/removing each alias PTE with its GFN, kvm_handle_hva() based
operations will be applied to both aliases.

In the case of the rmap being removed in the future, the needed
information could be recovered by iterating over the stolen bits and
walking the TDP page tables.

For TLB flushes that are address based, make sure to flush both aliases
in the stolen bits case.

Only support stolen bits in 64 bit guest paging modes (long, PAE).
Features that use this infrastructure should restrict the stolen bits to
exclude the other paging modes. Don't support stolen bits for shadow EPT.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu.h              | 26 ++++++++++
 arch/x86/kvm/mmu/mmu.c          | 86 ++++++++++++++++++++++-----------
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h  | 25 ++++++----
 4 files changed, 101 insertions(+), 37 deletions(-)

Comments

Paolo Bonzini July 6, 2021, 2:54 p.m. UTC | #1
On 03/07/21 00:04, isaku.yamahata@intel.com wrote:
> From: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
> perspective) to a single GPA (from a memslot perspective). GPA alising
> will be used to repurpose GPA bits as attribute bits, e.g. to expose an
> execute-only permission bit to the guest. To keep the implementation
> simple (relatively speaking), GPA aliasing is only supported via TDP.
> 
> Today KVM assumes two things that are broken by GPA aliasing.
>    1. GPAs coming from hardware can be simply shifted to get the GFNs.
>    2. GPA bits 51:MAXPHYADDR are reserved to zero.
> 
> With GPA aliasing, translating a GPA to GFN requires masking off the
> repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.
> 
> To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
> that is, bits stolen from the GPA to act as new virtualized attribute
> bits. A bit in the mask will cause the MMU code to create aliases of the
> GPA. It can also be used to find the GFN out of a GPA coming from a tdp
> fault.
> 
> To handle case (1) from above, retain any stolen bits when passing a GPA
> in KVM's MMU code, but strip them when converting to a GFN so that the
> GFN contains only the "real" GFN, i.e. never has repurposed bits set.
> 
> GFNs (without stolen bits) continue to be used to:
> 	-Specify physical memory by userspace via memslots
> 	-Map GPAs to TDP PTEs via RMAP
> 	-Specify dirty tracking and write protection
> 	-Look up MTRR types
> 	-Inject async page faults
> 
> Since there are now multiple aliases for the same aliased GPA, when
> userspace memory backing the memslots is paged out, both aliases need to be
> modified. Fortunately this happens automatically. Since rmap supports
> multiple mappings for the same GFN for PTE shadowing based paging, by
> adding/removing each alias PTE with its GFN, kvm_handle_hva() based
> operations will be applied to both aliases.
> 
> In the case of the rmap being removed in the future, the needed
> information could be recovered by iterating over the stolen bits and
> walking the TDP page tables.
> 
> For TLB flushes that are address based, make sure to flush both aliases
> in the stolen bits case.
> 
> Only support stolen bits in 64 bit guest paging modes (long, PAE).
> Features that use this infrastructure should restrict the stolen bits to
> exclude the other paging modes. Don't support stolen bits for shadow EPT.
> 
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

Looks good, but the commit message is obsolete.

Paolo

>   arch/x86/kvm/mmu.h              | 26 ++++++++++
>   arch/x86/kvm/mmu/mmu.c          | 86 ++++++++++++++++++++++-----------
>   arch/x86/kvm/mmu/mmu_internal.h |  1 +
>   arch/x86/kvm/mmu/paging_tmpl.h  | 25 ++++++----
>   4 files changed, 101 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 88d0ed5225a4..69b82857acdb 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -232,4 +232,30 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>   int kvm_mmu_post_init_vm(struct kvm *kvm);
>   void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
>   
> +static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
> +{
> +	/* Currently there are no stolen bits in KVM */
> +	return 0;
> +}
> +
> +static inline gfn_t vcpu_gfn_stolen_mask(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_gfn_stolen_mask(vcpu->kvm);
> +}
> +
> +static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
> +{
> +	return kvm_gfn_stolen_mask(kvm) << PAGE_SHIFT;
> +}
> +
> +static inline gpa_t vcpu_gpa_stolen_mask(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_gpa_stolen_mask(vcpu->kvm);
> +}
> +
> +static inline gfn_t vcpu_gpa_to_gfn_unalias(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> +	return (gpa >> PAGE_SHIFT) & ~vcpu_gfn_stolen_mask(vcpu);
> +}
> +
>   #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0dc4bf34ce9c..990ee645b8a2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -188,27 +188,37 @@ static inline bool kvm_available_flush_tlb_with_range(void)
>   	return kvm_x86_ops.tlb_remote_flush_with_range;
>   }
>   
> -static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
> -		struct kvm_tlb_range *range)
> -{
> -	int ret = -ENOTSUPP;
> -
> -	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
> -		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
> -
> -	if (ret)
> -		kvm_flush_remote_tlbs(kvm);
> -}
> -
>   void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
>   		u64 start_gfn, u64 pages)
>   {
>   	struct kvm_tlb_range range;
> +	u64 gfn_stolen_mask;
> +
> +	if (!kvm_available_flush_tlb_with_range())
> +		goto generic_flush;
> +
> +	/*
> +	 * Fall back to the big hammer flush if there is more than one
> +	 * GPA alias that needs to be flushed.
> +	 */
> +	gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
> +	if (hweight64(gfn_stolen_mask) > 1)
> +		goto generic_flush;
>   
>   	range.start_gfn = start_gfn;
>   	range.pages = pages;
> +	if (static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range))
> +		goto generic_flush;
> +
> +	if (!gfn_stolen_mask)
> +		return;
>   
> -	kvm_flush_remote_tlbs_with_range(kvm, &range);
> +	range.start_gfn |= gfn_stolen_mask;
> +	static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range);
> +	return;
> +
> +generic_flush:
> +	kvm_flush_remote_tlbs(kvm);
>   }
>   
>   bool is_nx_huge_page_enabled(void)
> @@ -1949,14 +1959,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
>   	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>   }
>   
> -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> -					     gfn_t gfn,
> -					     gva_t gaddr,
> -					     unsigned level,
> -					     int direct,
> -					     unsigned int access)
> +static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> +					       gfn_t gfn,
> +					       gfn_t gfn_stolen_bits,
> +					       gva_t gaddr,
> +					       unsigned int level,
> +					       int direct,
> +					       unsigned int access)
>   {
>   	bool direct_mmu = vcpu->arch.mmu->direct_map;
> +	gpa_t gfn_and_stolen = gfn | gfn_stolen_bits;
>   	union kvm_mmu_page_role role;
>   	struct hlist_head *sp_list;
>   	unsigned quadrant;
> @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   		role.quadrant = quadrant;
>   	}
>   
> -	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> +	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
>   	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> -		if (sp->gfn != gfn) {
> +		if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
>   			collisions++;
>   			continue;
>   		}
> @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   	sp = kvm_mmu_alloc_page(vcpu, direct);
>   
>   	sp->gfn = gfn;
> +	sp->gfn_stolen_bits = gfn_stolen_bits;
>   	sp->role = role;
>   	hlist_add_head(&sp->hash_link, sp_list);
>   	if (!direct) {
> @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   	return sp;
>   }
>   
> +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> +					     gva_t gaddr, unsigned int level,
> +					     int direct, unsigned int access)
> +{
> +	return __kvm_mmu_get_page(vcpu, gfn, 0, gaddr, level, direct, access);
> +}
> +
>   static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
>   					struct kvm_vcpu *vcpu, hpa_t root,
>   					u64 addr)
> @@ -2637,7 +2657,9 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
>   
>   	gfn = kvm_mmu_page_get_gfn(sp, start - sp->spt);
>   	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK);
> -	if (!slot)
> +
> +	/* Don't map private memslots for stolen bits */
> +	if (!slot || (sp->gfn_stolen_bits && slot->id >= KVM_USER_MEM_SLOTS))
>   		return -1;
>   
>   	ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start);
> @@ -2827,7 +2849,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>   	struct kvm_shadow_walk_iterator it;
>   	struct kvm_mmu_page *sp;
>   	int level, req_level, ret;
> -	gfn_t gfn = gpa >> PAGE_SHIFT;
> +	gpa_t gpa_stolen_mask = vcpu_gpa_stolen_mask(vcpu);
> +	gfn_t gfn = (gpa & ~gpa_stolen_mask) >> PAGE_SHIFT;
> +	gfn_t gfn_stolen_bits = (gpa & gpa_stolen_mask) >> PAGE_SHIFT;
>   	gfn_t base_gfn = gfn;
>   
>   	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> @@ -2852,8 +2876,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>   
>   		drop_large_spte(vcpu, it.sptep);
>   		if (!is_shadow_present_pte(*it.sptep)) {
> -			sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
> -					      it.level - 1, true, ACC_ALL);
> +			sp = __kvm_mmu_get_page(vcpu, base_gfn,
> +						gfn_stolen_bits, it.addr,
> +						it.level - 1, true, ACC_ALL);
>   
>   			link_shadow_page(vcpu, it.sptep, sp);
>   			if (is_tdp && huge_page_disallowed &&
> @@ -3689,6 +3714,13 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>   	if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
>   		return true;
>   
> +	/* Don't expose aliases for no slot GFNs or private memslots */
> +	if ((cr2_or_gpa & vcpu_gpa_stolen_mask(vcpu)) &&
> +	    !kvm_is_visible_memslot(slot)) {
> +		*pfn = KVM_PFN_NOSLOT;
> +		return false;
> +	}
> +
>   	/* Don't expose private memslots to L2. */
>   	if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
>   		*pfn = KVM_PFN_NOSLOT;
> @@ -3723,7 +3755,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>   	bool write = error_code & PFERR_WRITE_MASK;
>   	bool map_writable;
>   
> -	gfn_t gfn = gpa >> PAGE_SHIFT;
> +	gfn_t gfn = vcpu_gpa_to_gfn_unalias(vcpu, gpa);
>   	unsigned long mmu_seq;
>   	kvm_pfn_t pfn;
>   	hva_t hva;
> @@ -3833,7 +3865,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>   	     max_level > PG_LEVEL_4K;
>   	     max_level--) {
>   		int page_num = KVM_PAGES_PER_HPAGE(max_level);
> -		gfn_t base = (gpa >> PAGE_SHIFT) & ~(page_num - 1);
> +		gfn_t base = vcpu_gpa_to_gfn_unalias(vcpu, gpa) & ~(page_num - 1);
>   
>   		if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
>   			break;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index d64ccb417c60..c896ec9f3159 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -46,6 +46,7 @@ struct kvm_mmu_page {
>   	 */
>   	union kvm_mmu_page_role role;
>   	gfn_t gfn;
> +	gfn_t gfn_stolen_bits;
>   
>   	u64 *spt;
>   	/* hold the gfn of each spte inside spt */
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 823a5919f9fa..439dc141391b 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -25,7 +25,8 @@
>   	#define guest_walker guest_walker64
>   	#define FNAME(name) paging##64_##name
>   	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> -	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_ADDR_MASK(vcpu, lvl) (~vcpu_gpa_stolen_mask(vcpu) & \
> +					     PT64_LVL_ADDR_MASK(lvl))
>   	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>   	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>   	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> @@ -44,7 +45,7 @@
>   	#define guest_walker guest_walker32
>   	#define FNAME(name) paging##32_##name
>   	#define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
> -	#define PT_LVL_ADDR_MASK(lvl) PT32_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT32_LVL_ADDR_MASK(lvl)
>   	#define PT_LVL_OFFSET_MASK(lvl) PT32_LVL_OFFSET_MASK(lvl)
>   	#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
>   	#define PT_LEVEL_BITS PT32_LEVEL_BITS
> @@ -58,7 +59,7 @@
>   	#define guest_walker guest_walkerEPT
>   	#define FNAME(name) ept_##name
>   	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> -	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT64_LVL_ADDR_MASK(lvl)
>   	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>   	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>   	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> @@ -75,7 +76,7 @@
>   #define PT_GUEST_ACCESSED_MASK (1 << PT_GUEST_ACCESSED_SHIFT)
>   
>   #define gpte_to_gfn_lvl FNAME(gpte_to_gfn_lvl)
> -#define gpte_to_gfn(pte) gpte_to_gfn_lvl((pte), PG_LEVEL_4K)
> +#define gpte_to_gfn(vcpu, pte) gpte_to_gfn_lvl(vcpu, pte, PG_LEVEL_4K)
>   
>   /*
>    * The guest_walker structure emulates the behavior of the hardware page
> @@ -96,9 +97,9 @@ struct guest_walker {
>   	struct x86_exception fault;
>   };
>   
> -static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
> +static gfn_t gpte_to_gfn_lvl(struct kvm_vcpu *vcpu, pt_element_t gpte, int lvl)
>   {
> -	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
> +	return (gpte & PT_LVL_ADDR_MASK(vcpu, lvl)) >> PAGE_SHIFT;
>   }
>   
>   static inline void FNAME(protect_clean_gpte)(struct kvm_mmu *mmu, unsigned *access,
> @@ -366,7 +367,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>   		--walker->level;
>   
>   		index = PT_INDEX(addr, walker->level);
> -		table_gfn = gpte_to_gfn(pte);
> +		table_gfn = gpte_to_gfn(vcpu, pte);
>   		offset    = index * sizeof(pt_element_t);
>   		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
>   
> @@ -432,7 +433,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>   	if (unlikely(errcode))
>   		goto error;
>   
> -	gfn = gpte_to_gfn_lvl(pte, walker->level);
> +	gfn = gpte_to_gfn_lvl(vcpu, pte, walker->level);
>   	gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;
>   
>   	if (PTTYPE == 32 && walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
> @@ -537,12 +538,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>   	gfn_t gfn;
>   	kvm_pfn_t pfn;
>   
> +	WARN_ON(gpte & vcpu_gpa_stolen_mask(vcpu));
> +
>   	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
>   		return false;
>   
>   	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
>   
> -	gfn = gpte_to_gfn(gpte);
> +	gfn = gpte_to_gfn(vcpu, gpte);
>   	pte_access = sp->role.access & FNAME(gpte_access)(gpte);
>   	FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
>   	pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
> @@ -652,6 +655,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
>   
>   	direct_access = gw->pte_access;
>   
> +	WARN_ON(addr & vcpu_gpa_stolen_mask(vcpu));
> +
>   	top_level = vcpu->arch.mmu->root_level;
>   	if (top_level == PT32E_ROOT_LEVEL)
>   		top_level = PT32_ROOT_LEVEL;
> @@ -1067,7 +1072,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>   			continue;
>   		}
>   
> -		gfn = gpte_to_gfn(gpte);
> +		gfn = gpte_to_gfn(vcpu, gpte);
>   		pte_access = sp->role.access;
>   		pte_access &= FNAME(gpte_access)(gpte);
>   		FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
>
Kai Huang Aug. 5, 2021, 11:44 a.m. UTC | #2
On Fri, 2 Jul 2021 15:04:47 -0700 isaku.yamahata@intel.com wrote:
> From: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
> perspective) to a single GPA (from a memslot perspective). GPA alising
> will be used to repurpose GPA bits as attribute bits, e.g. to expose an
> execute-only permission bit to the guest. To keep the implementation
> simple (relatively speaking), GPA aliasing is only supported via TDP.
> 
> Today KVM assumes two things that are broken by GPA aliasing.
>   1. GPAs coming from hardware can be simply shifted to get the GFNs.
>   2. GPA bits 51:MAXPHYADDR are reserved to zero.
> 
> With GPA aliasing, translating a GPA to GFN requires masking off the
> repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.
> 
> To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
> that is, bits stolen from the GPA to act as new virtualized attribute
> bits. A bit in the mask will cause the MMU code to create aliases of the
> GPA. It can also be used to find the GFN out of a GPA coming from a tdp
> fault.
> 
> To handle case (1) from above, retain any stolen bits when passing a GPA
> in KVM's MMU code, but strip them when converting to a GFN so that the
> GFN contains only the "real" GFN, i.e. never has repurposed bits set.
> 
> GFNs (without stolen bits) continue to be used to:
> 	-Specify physical memory by userspace via memslots
> 	-Map GPAs to TDP PTEs via RMAP
> 	-Specify dirty tracking and write protection
> 	-Look up MTRR types
> 	-Inject async page faults
> 
> Since there are now multiple aliases for the same aliased GPA, when
> userspace memory backing the memslots is paged out, both aliases need to be
> modified. Fortunately this happens automatically. Since rmap supports
> multiple mappings for the same GFN for PTE shadowing based paging, by
> adding/removing each alias PTE with its GFN, kvm_handle_hva() based
> operations will be applied to both aliases.
> 
> In the case of the rmap being removed in the future, the needed
> information could be recovered by iterating over the stolen bits and
> walking the TDP page tables.
> 
> For TLB flushes that are address based, make sure to flush both aliases
> in the stolen bits case.
> 
> Only support stolen bits in 64 bit guest paging modes (long, PAE).
> Features that use this infrastructure should restrict the stolen bits to
> exclude the other paging modes. Don't support stolen bits for shadow EPT.
> 
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/mmu.h              | 26 ++++++++++
>  arch/x86/kvm/mmu/mmu.c          | 86 ++++++++++++++++++++++-----------
>  arch/x86/kvm/mmu/mmu_internal.h |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  | 25 ++++++----
>  4 files changed, 101 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 88d0ed5225a4..69b82857acdb 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -232,4 +232,30 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>  int kvm_mmu_post_init_vm(struct kvm *kvm);
>  void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
>  
> +static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
> +{
> +	/* Currently there are no stolen bits in KVM */
> +	return 0;
> +}
> +
> +static inline gfn_t vcpu_gfn_stolen_mask(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_gfn_stolen_mask(vcpu->kvm);
> +}
> +
> +static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
> +{
> +	return kvm_gfn_stolen_mask(kvm) << PAGE_SHIFT;
> +}
> +
> +static inline gpa_t vcpu_gpa_stolen_mask(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_gpa_stolen_mask(vcpu->kvm);
> +}
> +
> +static inline gfn_t vcpu_gpa_to_gfn_unalias(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> +	return (gpa >> PAGE_SHIFT) & ~vcpu_gfn_stolen_mask(vcpu);
> +}
> +
>  #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0dc4bf34ce9c..990ee645b8a2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -188,27 +188,37 @@ static inline bool kvm_available_flush_tlb_with_range(void)
>  	return kvm_x86_ops.tlb_remote_flush_with_range;
>  }
>  
> -static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
> -		struct kvm_tlb_range *range)
> -{
> -	int ret = -ENOTSUPP;
> -
> -	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
> -		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
> -
> -	if (ret)
> -		kvm_flush_remote_tlbs(kvm);
> -}
> -
>  void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
>  		u64 start_gfn, u64 pages)
>  {
>  	struct kvm_tlb_range range;
> +	u64 gfn_stolen_mask;
> +
> +	if (!kvm_available_flush_tlb_with_range())
> +		goto generic_flush;
> +
> +	/*
> +	 * Fall back to the big hammer flush if there is more than one
> +	 * GPA alias that needs to be flushed.
> +	 */
> +	gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
> +	if (hweight64(gfn_stolen_mask) > 1)
> +		goto generic_flush;
>  
>  	range.start_gfn = start_gfn;
>  	range.pages = pages;
> +	if (static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range))
> +		goto generic_flush;
> +
> +	if (!gfn_stolen_mask)
> +		return;
>  
> -	kvm_flush_remote_tlbs_with_range(kvm, &range);
> +	range.start_gfn |= gfn_stolen_mask;
> +	static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range);
> +	return;
> +
> +generic_flush:
> +	kvm_flush_remote_tlbs(kvm);
>  }
>  
>  bool is_nx_huge_page_enabled(void)
> @@ -1949,14 +1959,16 @@ static void clear_sp_write_flooding_count(u64 *spte)
>  	__clear_sp_write_flooding_count(sptep_to_sp(spte));
>  }
>  
> -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> -					     gfn_t gfn,
> -					     gva_t gaddr,
> -					     unsigned level,
> -					     int direct,
> -					     unsigned int access)
> +static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> +					       gfn_t gfn,
> +					       gfn_t gfn_stolen_bits,
> +					       gva_t gaddr,
> +					       unsigned int level,
> +					       int direct,
> +					       unsigned int access)
>  {
>  	bool direct_mmu = vcpu->arch.mmu->direct_map;
> +	gpa_t gfn_and_stolen = gfn | gfn_stolen_bits;
>  	union kvm_mmu_page_role role;
>  	struct hlist_head *sp_list;
>  	unsigned quadrant;
> @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  		role.quadrant = quadrant;
>  	}
>  
> -	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> +	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
>  	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> -		if (sp->gfn != gfn) {
> +		if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
>  			collisions++;
>  			continue;
>  		}
> @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  	sp = kvm_mmu_alloc_page(vcpu, direct);
>  
>  	sp->gfn = gfn;
> +	sp->gfn_stolen_bits = gfn_stolen_bits;
>  	sp->role = role;
>  	hlist_add_head(&sp->hash_link, sp_list);
>  	if (!direct) {
> @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  	return sp;
>  }


Sorry for replying old thread, but to me it looks weird to have gfn_stolen_bits
in 'struct kvm_mmu_page'.  If I understand correctly, above code basically
means that GFN with different stolen bit will have different 'struct
kvm_mmu_page', but in the context of this patch, mappings with different
stolen bits still use the same root, which means gfn_stolen_bits doesn't make a
lot of sense at least for root page table. 

Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
the context of TDX, since TDX requires two separate roots for private and
shared mappings.

So given we cannot tell whether the same root, or different roots should be
used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
'struct kvm_mmu_page' and use it to determine whether to allocate a new table
for the same GFN, but should use a new role (i.e role.private) to determine.

And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could also save some
memory.

>  
> +static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> +					     gva_t gaddr, unsigned int level,
> +					     int direct, unsigned int access)
> +{
> +	return __kvm_mmu_get_page(vcpu, gfn, 0, gaddr, level, direct, access);
> +}
> +
>  static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
>  					struct kvm_vcpu *vcpu, hpa_t root,
>  					u64 addr)
> @@ -2637,7 +2657,9 @@ static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
>  
>  	gfn = kvm_mmu_page_get_gfn(sp, start - sp->spt);
>  	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK);
> -	if (!slot)
> +
> +	/* Don't map private memslots for stolen bits */
> +	if (!slot || (sp->gfn_stolen_bits && slot->id >= KVM_USER_MEM_SLOTS))
>  		return -1;
>  
>  	ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start);
> @@ -2827,7 +2849,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  	struct kvm_shadow_walk_iterator it;
>  	struct kvm_mmu_page *sp;
>  	int level, req_level, ret;
> -	gfn_t gfn = gpa >> PAGE_SHIFT;
> +	gpa_t gpa_stolen_mask = vcpu_gpa_stolen_mask(vcpu);
> +	gfn_t gfn = (gpa & ~gpa_stolen_mask) >> PAGE_SHIFT;
> +	gfn_t gfn_stolen_bits = (gpa & gpa_stolen_mask) >> PAGE_SHIFT;
>  	gfn_t base_gfn = gfn;
>  
>  	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> @@ -2852,8 +2876,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  
>  		drop_large_spte(vcpu, it.sptep);
>  		if (!is_shadow_present_pte(*it.sptep)) {
> -			sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
> -					      it.level - 1, true, ACC_ALL);
> +			sp = __kvm_mmu_get_page(vcpu, base_gfn,
> +						gfn_stolen_bits, it.addr,
> +						it.level - 1, true, ACC_ALL);
>  
>  			link_shadow_page(vcpu, it.sptep, sp);
>  			if (is_tdp && huge_page_disallowed &&
> @@ -3689,6 +3714,13 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>  	if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
>  		return true;
>  
> +	/* Don't expose aliases for no slot GFNs or private memslots */
> +	if ((cr2_or_gpa & vcpu_gpa_stolen_mask(vcpu)) &&
> +	    !kvm_is_visible_memslot(slot)) {
> +		*pfn = KVM_PFN_NOSLOT;
> +		return false;
> +	}
> +
>  	/* Don't expose private memslots to L2. */
>  	if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
>  		*pfn = KVM_PFN_NOSLOT;
> @@ -3723,7 +3755,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  	bool write = error_code & PFERR_WRITE_MASK;
>  	bool map_writable;
>  
> -	gfn_t gfn = gpa >> PAGE_SHIFT;
> +	gfn_t gfn = vcpu_gpa_to_gfn_unalias(vcpu, gpa);
>  	unsigned long mmu_seq;
>  	kvm_pfn_t pfn;
>  	hva_t hva;
> @@ -3833,7 +3865,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  	     max_level > PG_LEVEL_4K;
>  	     max_level--) {
>  		int page_num = KVM_PAGES_PER_HPAGE(max_level);
> -		gfn_t base = (gpa >> PAGE_SHIFT) & ~(page_num - 1);
> +		gfn_t base = vcpu_gpa_to_gfn_unalias(vcpu, gpa) & ~(page_num - 1);
>  
>  		if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
>  			break;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index d64ccb417c60..c896ec9f3159 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -46,6 +46,7 @@ struct kvm_mmu_page {
>  	 */
>  	union kvm_mmu_page_role role;
>  	gfn_t gfn;
> +	gfn_t gfn_stolen_bits;
>  
>  	u64 *spt;
>  	/* hold the gfn of each spte inside spt */
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 823a5919f9fa..439dc141391b 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -25,7 +25,8 @@
>  	#define guest_walker guest_walker64
>  	#define FNAME(name) paging##64_##name
>  	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> -	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_ADDR_MASK(vcpu, lvl) (~vcpu_gpa_stolen_mask(vcpu) & \
> +					     PT64_LVL_ADDR_MASK(lvl))
>  	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>  	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>  	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> @@ -44,7 +45,7 @@
>  	#define guest_walker guest_walker32
>  	#define FNAME(name) paging##32_##name
>  	#define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
> -	#define PT_LVL_ADDR_MASK(lvl) PT32_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT32_LVL_ADDR_MASK(lvl)
>  	#define PT_LVL_OFFSET_MASK(lvl) PT32_LVL_OFFSET_MASK(lvl)
>  	#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
>  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
> @@ -58,7 +59,7 @@
>  	#define guest_walker guest_walkerEPT
>  	#define FNAME(name) ept_##name
>  	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> -	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT64_LVL_ADDR_MASK(lvl)
>  	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>  	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>  	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> @@ -75,7 +76,7 @@
>  #define PT_GUEST_ACCESSED_MASK (1 << PT_GUEST_ACCESSED_SHIFT)
>  
>  #define gpte_to_gfn_lvl FNAME(gpte_to_gfn_lvl)
> -#define gpte_to_gfn(pte) gpte_to_gfn_lvl((pte), PG_LEVEL_4K)
> +#define gpte_to_gfn(vcpu, pte) gpte_to_gfn_lvl(vcpu, pte, PG_LEVEL_4K)
>  
>  /*
>   * The guest_walker structure emulates the behavior of the hardware page
> @@ -96,9 +97,9 @@ struct guest_walker {
>  	struct x86_exception fault;
>  };
>  
> -static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
> +static gfn_t gpte_to_gfn_lvl(struct kvm_vcpu *vcpu, pt_element_t gpte, int lvl)
>  {
> -	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
> +	return (gpte & PT_LVL_ADDR_MASK(vcpu, lvl)) >> PAGE_SHIFT;
>  }
>  
>  static inline void FNAME(protect_clean_gpte)(struct kvm_mmu *mmu, unsigned *access,
> @@ -366,7 +367,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  		--walker->level;
>  
>  		index = PT_INDEX(addr, walker->level);
> -		table_gfn = gpte_to_gfn(pte);
> +		table_gfn = gpte_to_gfn(vcpu, pte);
>  		offset    = index * sizeof(pt_element_t);
>  		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
>  
> @@ -432,7 +433,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	if (unlikely(errcode))
>  		goto error;
>  
> -	gfn = gpte_to_gfn_lvl(pte, walker->level);
> +	gfn = gpte_to_gfn_lvl(vcpu, pte, walker->level);
>  	gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;
>  
>  	if (PTTYPE == 32 && walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
> @@ -537,12 +538,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	gfn_t gfn;
>  	kvm_pfn_t pfn;
>  
> +	WARN_ON(gpte & vcpu_gpa_stolen_mask(vcpu));
> +
>  	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
>  		return false;
>  
>  	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
>  
> -	gfn = gpte_to_gfn(gpte);
> +	gfn = gpte_to_gfn(vcpu, gpte);
>  	pte_access = sp->role.access & FNAME(gpte_access)(gpte);
>  	FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
>  	pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
> @@ -652,6 +655,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
>  
>  	direct_access = gw->pte_access;
>  
> +	WARN_ON(addr & vcpu_gpa_stolen_mask(vcpu));
> +
>  	top_level = vcpu->arch.mmu->root_level;
>  	if (top_level == PT32E_ROOT_LEVEL)
>  		top_level = PT32_ROOT_LEVEL;
> @@ -1067,7 +1072,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  			continue;
>  		}
>  
> -		gfn = gpte_to_gfn(gpte);
> +		gfn = gpte_to_gfn(vcpu, gpte);
>  		pte_access = sp->role.access;
>  		pte_access &= FNAME(gpte_access)(gpte);
>  		FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
> -- 
> 2.25.1
>
Sean Christopherson Aug. 5, 2021, 4:06 p.m. UTC | #3
On Thu, Aug 05, 2021, Kai Huang wrote:
> On Fri, 2 Jul 2021 15:04:47 -0700 isaku.yamahata@intel.com wrote:
> > From: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >  	sp = kvm_mmu_alloc_page(vcpu, direct);
> >  
> >  	sp->gfn = gfn;
> > +	sp->gfn_stolen_bits = gfn_stolen_bits;
> >  	sp->role = role;
> >  	hlist_add_head(&sp->hash_link, sp_list);
> >  	if (!direct) {
> > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >  	return sp;
> >  }
> 
> 
> Sorry for replying old thread,

Ha, one month isn't old, it's barely even mature.

> but to me it looks weird to have gfn_stolen_bits
> in 'struct kvm_mmu_page'.  If I understand correctly, above code basically
> means that GFN with different stolen bit will have different 'struct
> kvm_mmu_page', but in the context of this patch, mappings with different
> stolen bits still use the same root,

You're conflating "mapping" with "PTE".  The GFN is a per-PTE value.  Yes, there
is a final GFN that is representative of the mapping, but more directly the final
GFN is associated with the leaf PTE.

TDX effectively adds the restriction that all PTEs used for a mapping must have
the same shared/private status, so mapping and PTE are somewhat interchangeable
when talking about stolen bits (the shared bit), but in the context of this patch,
the stolen bits are a property of the PTE.

Back to your statement, it's incorrect.  PTEs (effectively mappings in TDX) with
different stolen bits will _not_ use the same root.  kvm_mmu_get_page() includes
the stolen bits in both the hash lookup and in the comparison, i.e. restores the
stolen bits when looking for an existing shadow page at the target GFN.

@@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
                role.quadrant = quadrant;
        }

-       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
        for_each_valid_sp(vcpu->kvm, sp, sp_list) {
-               if (sp->gfn != gfn) {
+               if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
                        collisions++;
                        continue;
                }

> which means gfn_stolen_bits doesn't make a lot of sense at least for root
> page table. 

It does make sense, even without a follow-up patch.  In Rick's original series,
stealing a bit for execute-only guest memory, there was only a single root.  And
except for TDX, there can only ever be a single root because the shared EPTP isn't
usable, i.e. there's only the regular/private EPTP.

> Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
> the context of TDX, since TDX requires two separate roots for private and
> shared mappings.

> So given we cannot tell whether the same root, or different roots should be
> used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
> 'struct kvm_mmu_page' and use it to determine whether to allocate a new table
> for the same GFN, but should use a new role (i.e role.private) to determine.

A new role would work, too, but it has the disadvantage of not automagically
working for all uses of stolen bits, e.g. XO support would have to add another
role bit.

> And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could also save some
> memory.

But I do like saving memory...  One potentially bad idea would be to unionize
gfn and stolen bits by shifting the stolen bits after they're extracted from the
gpa, e.g.

	union {
		gfn_t gfn_and_stolen;
		struct {
			gfn_t gfn:52;
			gfn_t stolen:12;
		}
	};

the downsides being that accessing just the gfn would require an additional masking
operation, and the stolen bits wouldn't align with reality.
Rick Edgecombe Aug. 5, 2021, 5:07 p.m. UTC | #4
On Thu, 2021-08-05 at 16:06 +0000, Sean Christopherson wrote:
> On Thu, Aug 05, 2021, Kai Huang wrote:
> > On Fri, 2 Jul 2021 15:04:47 -0700 isaku.yamahata@intel.com wrote:
> > > From: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page
> > > *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >  	sp = kvm_mmu_alloc_page(vcpu, direct);
> > >  
> > >  	sp->gfn = gfn;
> > > +	sp->gfn_stolen_bits = gfn_stolen_bits;
> > >  	sp->role = role;
> > >  	hlist_add_head(&sp->hash_link, sp_list);
> > >  	if (!direct) {
> > > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page
> > > *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >  	return sp;
> > >  }
> > 
> > 
> > Sorry for replying old thread,
> 
> Ha, one month isn't old, it's barely even mature.
> 
> > but to me it looks weird to have gfn_stolen_bits
> > in 'struct kvm_mmu_page'.  If I understand correctly, above code
> > basically
> > means that GFN with different stolen bit will have different
> > 'struct
> > kvm_mmu_page', but in the context of this patch, mappings with
> > different
> > stolen bits still use the same root,
> 
> You're conflating "mapping" with "PTE".  The GFN is a per-PTE
> value.  Yes, there
> is a final GFN that is representative of the mapping, but more
> directly the final
> GFN is associated with the leaf PTE.
> 
> TDX effectively adds the restriction that all PTEs used for a mapping
> must have
> the same shared/private status, so mapping and PTE are somewhat
> interchangeable
> when talking about stolen bits (the shared bit), but in the context
> of this patch,
> the stolen bits are a property of the PTE.
> 
> Back to your statement, it's incorrect.  PTEs (effectively mappings
> in TDX) with
> different stolen bits will _not_ use the same
> root.  kvm_mmu_get_page() includes
> the stolen bits in both the hash lookup and in the comparison, i.e.
> restores the
> stolen bits when looking for an existing shadow page at the target
> GFN.
> 
> @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page
> *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>                 role.quadrant = quadrant;
>         }
> 
> -       sp_list = &vcpu->kvm-
> >arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> +       sp_list = &vcpu->kvm-
> >arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
>         for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> -               if (sp->gfn != gfn) {
> +               if ((sp->gfn | sp->gfn_stolen_bits) !=
> gfn_and_stolen) {
>                         collisions++;
>                         continue;
>                 }
> 
> > which means gfn_stolen_bits doesn't make a lot of sense at least
> > for root
> > page table. 
> 
> It does make sense, even without a follow-up patch.  In Rick's
> original series,
> stealing a bit for execute-only guest memory, there was only a single
> root.  And
> except for TDX, there can only ever be a single root because the
> shared EPTP isn't
> usable, i.e. there's only the regular/private EPTP.
> 
> > Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes
> > sense in
> > the context of TDX, since TDX requires two separate roots for
> > private and
> > shared mappings.
> > So given we cannot tell whether the same root, or different roots
> > should be
> > used for different stolen bits, I think we should not add
> > 'gfn_stolen_bits' to
> > 'struct kvm_mmu_page' and use it to determine whether to allocate a
> > new table
> > for the same GFN, but should use a new role (i.e role.private) to
> > determine.
> 
> A new role would work, too, but it has the disadvantage of not
> automagically
> working for all uses of stolen bits, e.g. XO support would have to
> add another
> role bit.
> 
> > And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could also
> > save some
> > memory.
> 
> But I do like saving memory...  One potentially bad idea would be to
> unionize
> gfn and stolen bits by shifting the stolen bits after they're
> extracted from the
> gpa, e.g.
> 
> 	union {
> 		gfn_t gfn_and_stolen;
> 		struct {
> 			gfn_t gfn:52;
> 			gfn_t stolen:12;
> 		}
> 	};
> 
> the downsides being that accessing just the gfn would require an
> additional masking
> operation, and the stolen bits wouldn't align with reality.

It definitely seems like the sp could be packed more efficiently.
One other idea is the stolen bits could just be recovered from the role
bits with a helper, like how the page fault error code stolen bits
encoding version of this works.

If the stolen bits are not fed into the hash calculation though it
would change the behavior a bit. Not sure if for better or worse. Also
the calculation of hash collisions would need to be aware.

FWIW, I kind of like something like Sean's proposal. It's a bit
convoluted, but there are more unused bits in the gfn than the role.
Also they are a little more related.
Sean Christopherson Aug. 5, 2021, 5:39 p.m. UTC | #5
On Thu, Aug 05, 2021, Edgecombe, Rick P wrote:
> On Thu, 2021-08-05 at 16:06 +0000, Sean Christopherson wrote:
> > On Thu, Aug 05, 2021, Kai Huang wrote:
> > > And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could also save
> > > some memory.
> > 
> > But I do like saving memory...  One potentially bad idea would be to
> > unionize gfn and stolen bits by shifting the stolen bits after they're
> > extracted from the gpa, e.g.
> > 
> > 	union {
> > 		gfn_t gfn_and_stolen;
> > 		struct {
> > 			gfn_t gfn:52;
> > 			gfn_t stolen:12;
> > 		}
> > 	};
> > 
> > the downsides being that accessing just the gfn would require an additional
> > masking operation, and the stolen bits wouldn't align with reality.
> 
> It definitely seems like the sp could be packed more efficiently.

Yeah, in general it could be optimized.  But for TDP/direct MMUs, we don't care
thaaat much because there are relatively few shadow pages, versus indirect MMUs
with thousands or tens of thousands of shadow pages.  Of course, indirect MMUs
are also the most gluttonous due to the unsync_child_bitmap, gfns, write flooding
count, etc...

If we really want to reduce the memory footprint for the common case (TDP MMU),
the crud that's used only by indirect shadow pages could be shoved into a
different struct by abusing the struct layout and and wrapping accesses to the
indirect-only fields with casts/container_of and helpers, e.g.

struct kvm_mmu_indirect_page {
	struct kvm_mmu_page this;

	gfn_t *gfns;
	unsigned int unsync_children;
	DECLARE_BITMAP(unsync_child_bitmap, 512);

#ifdef CONFIG_X86_32
	/*
	 * Used out of the mmu-lock to avoid reading spte values while an
	 * update is in progress; see the comments in __get_spte_lockless().
	 */
	int clear_spte_count;
#endif

	/* Number of writes since the last time traversal visited this page.  */
	atomic_t write_flooding_count;
}


> One other idea is the stolen bits could just be recovered from the role
> bits with a helper, like how the page fault error code stolen bits
> encoding version of this works.

As in, a generic "stolen_gfn_bits" in the role instead of a per-feature role bit?
That would avoid the problem of per-feature role bits leading to a pile of
marshalling code, and wouldn't suffer the masking cost when accessing ->gfn,
though I'm not sure that matters much.

> If the stolen bits are not fed into the hash calculation though it
> would change the behavior a bit. Not sure if for better or worse. Also
> the calculation of hash collisions would need to be aware.

The role is already factored into the collision logic.

> FWIW, I kind of like something like Sean's proposal. It's a bit
> convoluted, but there are more unused bits in the gfn than the role.

And tightly bound, i.e. there can't be more than gfn_t gfn+gfn_stolen bits.

> Also they are a little more related.
Rick Edgecombe Aug. 5, 2021, 6:43 p.m. UTC | #6
On Thu, 2021-08-05 at 17:39 +0000, Sean Christopherson wrote:
> On Thu, Aug 05, 2021, Edgecombe, Rick P wrote:
> > On Thu, 2021-08-05 at 16:06 +0000, Sean Christopherson wrote:
> > > On Thu, Aug 05, 2021, Kai Huang wrote:
> > > > And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could
> > > > also save
> > > > some memory.
> > > 
> > > But I do like saving memory...  One potentially bad idea would be
> > > to
> > > unionize gfn and stolen bits by shifting the stolen bits after
> > > they're
> > > extracted from the gpa, e.g.
> > > 
> > > 	union {
> > > 		gfn_t gfn_and_stolen;
> > > 		struct {
> > > 			gfn_t gfn:52;
> > > 			gfn_t stolen:12;
> > > 		}
> > > 	};
> > > 
> > > the downsides being that accessing just the gfn would require an
> > > additional
> > > masking operation, and the stolen bits wouldn't align with
> > > reality.
> > 
> > It definitely seems like the sp could be packed more efficiently.
> 
> Yeah, in general it could be optimized.  But for TDP/direct MMUs, we
> don't care
> thaaat much because there are relatively few shadow pages, versus
> indirect MMUs
> with thousands or tens of thousands of shadow pages.  Of course,
> indirect MMUs
> are also the most gluttonous due to the unsync_child_bitmap, gfns,
> write flooding
> count, etc...
> 
> If we really want to reduce the memory footprint for the common case
> (TDP MMU),
> the crud that's used only by indirect shadow pages could be shoved
> into a
> different struct by abusing the struct layout and and wrapping
> accesses to the
> indirect-only fields with casts/container_of and helpers, e.g.
> 
Wow, didn't realize classic MMU was that relegated already. Mostly an
onlooker here, but does TDX actually need classic MMU support? Nice to
have?

> struct kvm_mmu_indirect_page {
> 	struct kvm_mmu_page this;
> 
> 	gfn_t *gfns;
> 	unsigned int unsync_children;
> 	DECLARE_BITMAP(unsync_child_bitmap, 512);
> 
> #ifdef CONFIG_X86_32
> 	/*
> 	 * Used out of the mmu-lock to avoid reading spte values while
> an
> 	 * update is in progress; see the comments in
> __get_spte_lockless().
> 	 */
> 	int clear_spte_count;
> #endif
> 
> 	/* Number of writes since the last time traversal visited this
> page.  */
> 	atomic_t write_flooding_count;
> }
> 
> 
> > One other idea is the stolen bits could just be recovered from the
> > role
> > bits with a helper, like how the page fault error code stolen bits
> > encoding version of this works.
> 
> As in, a generic "stolen_gfn_bits" in the role instead of a per-
> feature role bit?
> That would avoid the problem of per-feature role bits leading to a
> pile of
> marshalling code, and wouldn't suffer the masking cost when accessing
> ->gfn,
> though I'm not sure that matters much.
Well I was thinking multiple types of aliases, like the pf err code
stuff works, like this:

gfn_t stolen_bits(struct kvm *kvm, struct kvm_mmu_page *sp)
{
	gfn_t stolen = 0;

	if (sp->role.shared)
		stolen |= kvm->arch.gfn_shared_mask;
	if (sp->role.other_alias)
		stolen |= kvm->arch.gfn_other_mask;

	return stolen;
}

But yea, there really only needs to be one. Still bit shifting seems
better.

> 
> > If the stolen bits are not fed into the hash calculation though it
> > would change the behavior a bit. Not sure if for better or worse.
> > Also
> > the calculation of hash collisions would need to be aware.
> 
> The role is already factored into the collision logic.
I mean how aliases of the same gfn don't necessarily collide and the
collisions counter is only incremented if the gfn/stolen matches, but
not if the role is different.

> 
> > FWIW, I kind of like something like Sean's proposal. It's a bit
> > convoluted, but there are more unused bits in the gfn than the
> > role.
> 
> And tightly bound, i.e. there can't be more than gfn_t gfn+gfn_stolen
> bits.
> 
> > Also they are a little more related.
> 
>
Sean Christopherson Aug. 5, 2021, 6:58 p.m. UTC | #7
On Thu, Aug 05, 2021, Edgecombe, Rick P wrote:
> On Thu, 2021-08-05 at 17:39 +0000, Sean Christopherson wrote:
> > If we really want to reduce the memory footprint for the common case (TDP
> > MMU), the crud that's used only by indirect shadow pages could be shoved
> > into a different struct by abusing the struct layout and and wrapping
> > accesses to the indirect-only fields with casts/container_of and helpers,
> > e.g.
> > 
> Wow, didn't realize classic MMU was that relegated already. Mostly an
> onlooker here, but does TDX actually need classic MMU support? Nice to
> have?

Gah, bad verbiage on my part.  I didn't me _the_ TDP MMU, I meant "MMU that uses
TDP".  The "TDP MMU" is being enabled by default in upstream, whether or not TDX
needs to support the classic/legacy MMU is an unanswered question.  From a
maintenance perspective, I'd love to say no, but from a "what does Google actually
want to use for TDX" perspective, I don't have a defininitive answer yet :-/

> > The role is already factored into the collision logic.
>
> I mean how aliases of the same gfn don't necessarily collide and the
> collisions counter is only incremented if the gfn/stolen matches, but
> not if the role is different.

Ah.  I still think it would Just Work.  The collisions tracking is purely for
stats, presumably used in the past to see if the hash size was effective.  If
a hash lookup collides then it should be accounted, _why_ it collided doesn't
factor in to the stats.

Your comments about the hash behavior being different still stand, though for
TDX they wouldn't matter in practice since KVM is hosed if it has both shared and
private versions of a shadow page.
Kai Huang Aug. 5, 2021, 9:59 p.m. UTC | #8
On Thu, 5 Aug 2021 16:06:41 +0000 Sean Christopherson wrote:
> On Thu, Aug 05, 2021, Kai Huang wrote:
> > On Fri, 2 Jul 2021 15:04:47 -0700 isaku.yamahata@intel.com wrote:
> > > From: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >  	sp = kvm_mmu_alloc_page(vcpu, direct);
> > >  
> > >  	sp->gfn = gfn;
> > > +	sp->gfn_stolen_bits = gfn_stolen_bits;
> > >  	sp->role = role;
> > >  	hlist_add_head(&sp->hash_link, sp_list);
> > >  	if (!direct) {
> > > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >  	return sp;
> > >  }
> > 
> > 
> > Sorry for replying old thread,
> 
> Ha, one month isn't old, it's barely even mature.
> 
> > but to me it looks weird to have gfn_stolen_bits
> > in 'struct kvm_mmu_page'.  If I understand correctly, above code basically
> > means that GFN with different stolen bit will have different 'struct
> > kvm_mmu_page', but in the context of this patch, mappings with different
> > stolen bits still use the same root,
> 
> You're conflating "mapping" with "PTE".  The GFN is a per-PTE value.  Yes, there
> is a final GFN that is representative of the mapping, but more directly the final
> GFN is associated with the leaf PTE.
> 
> TDX effectively adds the restriction that all PTEs used for a mapping must have
> the same shared/private status, so mapping and PTE are somewhat interchangeable
> when talking about stolen bits (the shared bit), but in the context of this patch,
> the stolen bits are a property of the PTE.

Yes it is a property of PTE, this is the reason that I think it's weird to have
stolen bits in 'struct kvm_mmu_page'. Shouldn't stolen bits in 'struct
kvm_mmu_page' imply that all PTEs (whether leaf or not) share the same
stolen bit?

> 
> Back to your statement, it's incorrect.  PTEs (effectively mappings in TDX) with
> different stolen bits will _not_ use the same root.  kvm_mmu_get_page() includes
> the stolen bits in both the hash lookup and in the comparison, i.e. restores the
> stolen bits when looking for an existing shadow page at the target GFN.
> 
> @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>                 role.quadrant = quadrant;
>         }
> 
> -       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> +       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
>         for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> -               if (sp->gfn != gfn) {
> +               if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
>                         collisions++;
>                         continue;
>                 }
> 

This only works for non-root table, but there's only one single
vcpu->arch.mmu->root_hpa, we don't have an array to have one root for each
stolen bit, i.e. do a loop in mmu_alloc_direct_roots(), so effectively all
stolen bits share one single root.

> > which means gfn_stolen_bits doesn't make a lot of sense at least for root
> > page table. 
> 
> It does make sense, even without a follow-up patch.  In Rick's original series,
> stealing a bit for execute-only guest memory, there was only a single root.  And
> except for TDX, there can only ever be a single root because the shared EPTP isn't
> usable, i.e. there's only the regular/private EPTP.
> 
> > Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
> > the context of TDX, since TDX requires two separate roots for private and
> > shared mappings.
> 
> > So given we cannot tell whether the same root, or different roots should be
> > used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
> > 'struct kvm_mmu_page' and use it to determine whether to allocate a new table
> > for the same GFN, but should use a new role (i.e role.private) to determine.
> 
> A new role would work, too, but it has the disadvantage of not automagically
> working for all uses of stolen bits, e.g. XO support would have to add another
> role bit.

For each purpose of particular stolen bit, a new role can be defined.  For
instance, in __direct_map(), if you see stolen bit is TDX shared bit, you don't
set role.private (otherwise set role.private).  For XO, if you see the stolen
bit is XO, you set role.xo.

We already have info of 'gfn_stolen_mask' in vcpu, so we just need to make sure
all code paths can find the actual stolen bit based on sp->role and vcpu (I
haven't gone through all though, assuming the annoying part is rmap).

> 
> > And removing 'gfn_stolen_bits' in 'struct kvm_mmu_page' could also save some
> > memory.
> 
> But I do like saving memory...  One potentially bad idea would be to unionize
> gfn and stolen bits by shifting the stolen bits after they're extracted from the
> gpa, e.g.
> 
> 	union {
> 		gfn_t gfn_and_stolen;
> 		struct {
> 			gfn_t gfn:52;
> 			gfn_t stolen:12;
> 		}
> 	};
> 
> the downsides being that accessing just the gfn would require an additional masking
> operation, and the stolen bits wouldn't align with reality.
Sean Christopherson Aug. 6, 2021, 7:02 p.m. UTC | #9
On Fri, Aug 06, 2021, Kai Huang wrote:
> On Thu, 5 Aug 2021 16:06:41 +0000 Sean Christopherson wrote:
> > On Thu, Aug 05, 2021, Kai Huang wrote:
> > > On Fri, 2 Jul 2021 15:04:47 -0700 isaku.yamahata@intel.com wrote:
> > > > From: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >  	sp = kvm_mmu_alloc_page(vcpu, direct);
> > > >  
> > > >  	sp->gfn = gfn;
> > > > +	sp->gfn_stolen_bits = gfn_stolen_bits;
> > > >  	sp->role = role;
> > > >  	hlist_add_head(&sp->hash_link, sp_list);
> > > >  	if (!direct) {
> > > > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >  	return sp;
> > > >  }
> > > 
> > > 
> > > Sorry for replying old thread,
> > 
> > Ha, one month isn't old, it's barely even mature.
> > 
> > > but to me it looks weird to have gfn_stolen_bits
> > > in 'struct kvm_mmu_page'.  If I understand correctly, above code basically
> > > means that GFN with different stolen bit will have different 'struct
> > > kvm_mmu_page', but in the context of this patch, mappings with different
> > > stolen bits still use the same root,
> > 
> > You're conflating "mapping" with "PTE".  The GFN is a per-PTE value.  Yes, there
> > is a final GFN that is representative of the mapping, but more directly the final
> > GFN is associated with the leaf PTE.
> > 
> > TDX effectively adds the restriction that all PTEs used for a mapping must have
> > the same shared/private status, so mapping and PTE are somewhat interchangeable
> > when talking about stolen bits (the shared bit), but in the context of this patch,
> > the stolen bits are a property of the PTE.
> 
> Yes it is a property of PTE, this is the reason that I think it's weird to have
> stolen bits in 'struct kvm_mmu_page'. Shouldn't stolen bits in 'struct
> kvm_mmu_page' imply that all PTEs (whether leaf or not) share the same
> stolen bit?

No, the stolen bits are the property of the shadow page.  I'm using "PTE" above
to mean "PTE for this shadow page", not PTEs within the shadow page, if that makes
sense.

> > Back to your statement, it's incorrect.  PTEs (effectively mappings in TDX) with
> > different stolen bits will _not_ use the same root.  kvm_mmu_get_page() includes
> > the stolen bits in both the hash lookup and in the comparison, i.e. restores the
> > stolen bits when looking for an existing shadow page at the target GFN.
> > 
> > @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >                 role.quadrant = quadrant;
> >         }
> > 
> > -       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > +       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
> >         for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > -               if (sp->gfn != gfn) {
> > +               if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
> >                         collisions++;
> >                         continue;
> >                 }
> > 
> 
> This only works for non-root table, but there's only one single
> vcpu->arch.mmu->root_hpa, we don't have an array to have one root for each
> stolen bit, i.e. do a loop in mmu_alloc_direct_roots(), so effectively all
> stolen bits share one single root.

Yes, and that's absolutely the required behavior for everything except for TDX
with its two EPTPs.  E.g. any other implement _must_ reject CR3s that set stolen
gfn bits.

> > > which means gfn_stolen_bits doesn't make a lot of sense at least for root
> > > page table. 
> > 
> > It does make sense, even without a follow-up patch.  In Rick's original series,
> > stealing a bit for execute-only guest memory, there was only a single root.  And
> > except for TDX, there can only ever be a single root because the shared EPTP isn't
> > usable, i.e. there's only the regular/private EPTP.
> > 
> > > Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
> > > the context of TDX, since TDX requires two separate roots for private and
> > > shared mappings.
> > 
> > > So given we cannot tell whether the same root, or different roots should be
> > > used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
> > > 'struct kvm_mmu_page' and use it to determine whether to allocate a new table
> > > for the same GFN, but should use a new role (i.e role.private) to determine.
> > 
> > A new role would work, too, but it has the disadvantage of not automagically
> > working for all uses of stolen bits, e.g. XO support would have to add another
> > role bit.
> 
> For each purpose of particular stolen bit, a new role can be defined.  For
> instance, in __direct_map(), if you see stolen bit is TDX shared bit, you don't
> set role.private (otherwise set role.private).  For XO, if you see the stolen
> bit is XO, you set role.xo.
> 
> We already have info of 'gfn_stolen_mask' in vcpu, so we just need to make sure
> all code paths can find the actual stolen bit based on sp->role and vcpu (I
> haven't gone through all though, assuming the annoying part is rmap).

Yes, and I'm not totally against the idea, but I'm also not 100% sold on it either,
yet...  The idea of a 'private' flag is growing on me.

If we're treating the shared bit as an attribute bit, which we are, then it's
effectively an extension of role.access.  Ditto for XO.

And looking at the code, I think it would be an improvement for TDX, as all of
the is_private_gfn() calls that operate on a shadow page would be simplified and
optimized as they wouldn't have to lookup both gfn_stolen_bits and the vcpu->kvm
mask of the shared bit.

Actually, the more I think about it, the more I like it.  For TDX, there's no
risk of increased hash collisions, as we've already done messed up if there's a
shared vs. private collision.

And for XO, if it ever makes it way upstream, I think we should flat out disallow
referencing XO addresses in non-leaf PTEs, i.e. make the XO permission bit reserved
in non-leaf PTEs.  That would avoid any theoretical problems with the guest doing
something stupid by polluting all its upper level PxEs with XO.  Collisions would
be purely limited to the case where the guest is intentionally creating an alternate
mapping, which should be a rare event (or the guest is comprosied, which is also
hopefully a rare event).
Kai Huang Aug. 6, 2021, 10 p.m. UTC | #10
On Fri, 6 Aug 2021 19:02:39 +0000 Sean Christopherson wrote:
> On Fri, Aug 06, 2021, Kai Huang wrote:
> > On Thu, 5 Aug 2021 16:06:41 +0000 Sean Christopherson wrote:
> > > On Thu, Aug 05, 2021, Kai Huang wrote:
> > > > On Fri, 2 Jul 2021 15:04:47 -0700 isaku.yamahata@intel.com wrote:
> > > > > From: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > @@ -2020,6 +2032,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > >  	sp = kvm_mmu_alloc_page(vcpu, direct);
> > > > >  
> > > > >  	sp->gfn = gfn;
> > > > > +	sp->gfn_stolen_bits = gfn_stolen_bits;
> > > > >  	sp->role = role;
> > > > >  	hlist_add_head(&sp->hash_link, sp_list);
> > > > >  	if (!direct) {
> > > > > @@ -2044,6 +2057,13 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > >  	return sp;
> > > > >  }
> > > > 
> > > > 
> > > > Sorry for replying old thread,
> > > 
> > > Ha, one month isn't old, it's barely even mature.
> > > 
> > > > but to me it looks weird to have gfn_stolen_bits
> > > > in 'struct kvm_mmu_page'.  If I understand correctly, above code basically
> > > > means that GFN with different stolen bit will have different 'struct
> > > > kvm_mmu_page', but in the context of this patch, mappings with different
> > > > stolen bits still use the same root,
> > > 
> > > You're conflating "mapping" with "PTE".  The GFN is a per-PTE value.  Yes, there
> > > is a final GFN that is representative of the mapping, but more directly the final
> > > GFN is associated with the leaf PTE.
> > > 
> > > TDX effectively adds the restriction that all PTEs used for a mapping must have
> > > the same shared/private status, so mapping and PTE are somewhat interchangeable
> > > when talking about stolen bits (the shared bit), but in the context of this patch,
> > > the stolen bits are a property of the PTE.
> > 
> > Yes it is a property of PTE, this is the reason that I think it's weird to have
> > stolen bits in 'struct kvm_mmu_page'. Shouldn't stolen bits in 'struct
> > kvm_mmu_page' imply that all PTEs (whether leaf or not) share the same
> > stolen bit?
> 
> No, the stolen bits are the property of the shadow page.  I'm using "PTE" above
> to mean "PTE for this shadow page", not PTEs within the shadow page, if that makes
> sense.

I see.

> 
> > > Back to your statement, it's incorrect.  PTEs (effectively mappings in TDX) with
> > > different stolen bits will _not_ use the same root.  kvm_mmu_get_page() includes
> > > the stolen bits in both the hash lookup and in the comparison, i.e. restores the
> > > stolen bits when looking for an existing shadow page at the target GFN.
> > > 
> > > @@ -1978,9 +1990,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > >                 role.quadrant = quadrant;
> > >         }
> > > 
> > > -       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > > +       sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
> > >         for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> > > -               if (sp->gfn != gfn) {
> > > +               if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
> > >                         collisions++;
> > >                         continue;
> > >                 }
> > > 
> > 
> > This only works for non-root table, but there's only one single
> > vcpu->arch.mmu->root_hpa, we don't have an array to have one root for each
> > stolen bit, i.e. do a loop in mmu_alloc_direct_roots(), so effectively all
> > stolen bits share one single root.
> 
> Yes, and that's absolutely the required behavior for everything except for TDX
> with its two EPTPs.  E.g. any other implement _must_ reject CR3s that set stolen
> gfn bits.

OK.  I was thinking gfn_stolen_bits for 'struct kvm_mmu_page' for the table
pointed by CR3 should still make sense.

> 
> > > > which means gfn_stolen_bits doesn't make a lot of sense at least for root
> > > > page table. 
> > > 
> > > It does make sense, even without a follow-up patch.  In Rick's original series,
> > > stealing a bit for execute-only guest memory, there was only a single root.  And
> > > except for TDX, there can only ever be a single root because the shared EPTP isn't
> > > usable, i.e. there's only the regular/private EPTP.
> > > 
> > > > Instead, having gfn_stolen_bits in 'struct kvm_mmu_page' only makes sense in
> > > > the context of TDX, since TDX requires two separate roots for private and
> > > > shared mappings.
> > > 
> > > > So given we cannot tell whether the same root, or different roots should be
> > > > used for different stolen bits, I think we should not add 'gfn_stolen_bits' to
> > > > 'struct kvm_mmu_page' and use it to determine whether to allocate a new table
> > > > for the same GFN, but should use a new role (i.e role.private) to determine.
> > > 
> > > A new role would work, too, but it has the disadvantage of not automagically
> > > working for all uses of stolen bits, e.g. XO support would have to add another
> > > role bit.
> > 
> > For each purpose of particular stolen bit, a new role can be defined.  For
> > instance, in __direct_map(), if you see stolen bit is TDX shared bit, you don't
> > set role.private (otherwise set role.private).  For XO, if you see the stolen
> > bit is XO, you set role.xo.
> > 
> > We already have info of 'gfn_stolen_mask' in vcpu, so we just need to make sure
> > all code paths can find the actual stolen bit based on sp->role and vcpu (I
> > haven't gone through all though, assuming the annoying part is rmap).
> 
> Yes, and I'm not totally against the idea, but I'm also not 100% sold on it either,
> yet...  The idea of a 'private' flag is growing on me.
> 
> If we're treating the shared bit as an attribute bit, which we are, then it's
> effectively an extension of role.access.  Ditto for XO.
> 
> And looking at the code, I think it would be an improvement for TDX, as all of
> the is_private_gfn() calls that operate on a shadow page would be simplified and
> optimized as they wouldn't have to lookup both gfn_stolen_bits and the vcpu->kvm
> mask of the shared bit.
> 
> Actually, the more I think about it, the more I like it.  For TDX, there's no
> risk of increased hash collisions, as we've already done messed up if there's a
> shared vs. private collision.
> 
> And for XO, if it ever makes it way upstream, I think we should flat out disallow
> referencing XO addresses in non-leaf PTEs, i.e. make the XO permission bit reserved
> in non-leaf PTEs.  That would avoid any theoretical problems with the guest doing
> something stupid by polluting all its upper level PxEs with XO.  Collisions would
> be purely limited to the case where the guest is intentionally creating an alternate
> mapping, which should be a rare event (or the guest is comprosied, which is also
> hopefully a rare event).
> 
> 

My main motivation is 'gfn_stolen_bits' doesn't quite make sense for 'struct
kvm_mmu_page' for root, plus it seems it's a little bit redundant at first
glance.

So could we have your final suggestion? :)

Thanks,
-Kai
Sean Christopherson Aug. 6, 2021, 10:09 p.m. UTC | #11
On Sat, Aug 07, 2021, Kai Huang wrote:
> So could we have your final suggestion? :)

Try the kvm_mmu_page_role.private approach.  So long as it doesn't end up splattering
code everywhere, I think that's more aligned with how KVM generally wants to treat
the shared bit.

In the changelog for this patch, make it very clear that ensuring different aliases
get different shadow page (if necessary) is the responsiblity of each individual
feature that leverages stolen gfn bits.
Kai Huang Aug. 6, 2021, 10:24 p.m. UTC | #12
On Fri, 6 Aug 2021 22:09:35 +0000 Sean Christopherson wrote:
> On Sat, Aug 07, 2021, Kai Huang wrote:
> > So could we have your final suggestion? :)
> 
> Try the kvm_mmu_page_role.private approach.  So long as it doesn't end up splattering
> code everywhere, I think that's more aligned with how KVM generally wants to treat
> the shared bit.

OK.

> 
> In the changelog for this patch, make it very clear that ensuring different aliases
> get different shadow page (if necessary) is the responsiblity of each individual
> feature that leverages stolen gfn bits.

Make sense.  Thanks.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 88d0ed5225a4..69b82857acdb 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -232,4 +232,30 @@  int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 int kvm_mmu_post_init_vm(struct kvm *kvm);
 void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
 
+static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
+{
+	/* Currently there are no stolen bits in KVM */
+	return 0;
+}
+
+static inline gfn_t vcpu_gfn_stolen_mask(struct kvm_vcpu *vcpu)
+{
+	return kvm_gfn_stolen_mask(vcpu->kvm);
+}
+
+static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
+{
+	return kvm_gfn_stolen_mask(kvm) << PAGE_SHIFT;
+}
+
+static inline gpa_t vcpu_gpa_stolen_mask(struct kvm_vcpu *vcpu)
+{
+	return kvm_gpa_stolen_mask(vcpu->kvm);
+}
+
+static inline gfn_t vcpu_gpa_to_gfn_unalias(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	return (gpa >> PAGE_SHIFT) & ~vcpu_gfn_stolen_mask(vcpu);
+}
+
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0dc4bf34ce9c..990ee645b8a2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -188,27 +188,37 @@  static inline bool kvm_available_flush_tlb_with_range(void)
 	return kvm_x86_ops.tlb_remote_flush_with_range;
 }
 
-static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
-		struct kvm_tlb_range *range)
-{
-	int ret = -ENOTSUPP;
-
-	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
-		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
-
-	if (ret)
-		kvm_flush_remote_tlbs(kvm);
-}
-
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 		u64 start_gfn, u64 pages)
 {
 	struct kvm_tlb_range range;
+	u64 gfn_stolen_mask;
+
+	if (!kvm_available_flush_tlb_with_range())
+		goto generic_flush;
+
+	/*
+	 * Fall back to the big hammer flush if there is more than one
+	 * GPA alias that needs to be flushed.
+	 */
+	gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
+	if (hweight64(gfn_stolen_mask) > 1)
+		goto generic_flush;
 
 	range.start_gfn = start_gfn;
 	range.pages = pages;
+	if (static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range))
+		goto generic_flush;
+
+	if (!gfn_stolen_mask)
+		return;
 
-	kvm_flush_remote_tlbs_with_range(kvm, &range);
+	range.start_gfn |= gfn_stolen_mask;
+	static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, &range);
+	return;
+
+generic_flush:
+	kvm_flush_remote_tlbs(kvm);
 }
 
 bool is_nx_huge_page_enabled(void)
@@ -1949,14 +1959,16 @@  static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     int direct,
-					     unsigned int access)
+static struct kvm_mmu_page *__kvm_mmu_get_page(struct kvm_vcpu *vcpu,
+					       gfn_t gfn,
+					       gfn_t gfn_stolen_bits,
+					       gva_t gaddr,
+					       unsigned int level,
+					       int direct,
+					       unsigned int access)
 {
 	bool direct_mmu = vcpu->arch.mmu->direct_map;
+	gpa_t gfn_and_stolen = gfn | gfn_stolen_bits;
 	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
 	unsigned quadrant;
@@ -1978,9 +1990,9 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		role.quadrant = quadrant;
 	}
 
-	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
+	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn_and_stolen)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
-		if (sp->gfn != gfn) {
+		if ((sp->gfn | sp->gfn_stolen_bits) != gfn_and_stolen) {
 			collisions++;
 			continue;
 		}
@@ -2020,6 +2032,7 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	sp = kvm_mmu_alloc_page(vcpu, direct);
 
 	sp->gfn = gfn;
+	sp->gfn_stolen_bits = gfn_stolen_bits;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (!direct) {
@@ -2044,6 +2057,13 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     gva_t gaddr, unsigned int level,
+					     int direct, unsigned int access)
+{
+	return __kvm_mmu_get_page(vcpu, gfn, 0, gaddr, level, direct, access);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2637,7 +2657,9 @@  static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
 
 	gfn = kvm_mmu_page_get_gfn(sp, start - sp->spt);
 	slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, access & ACC_WRITE_MASK);
-	if (!slot)
+
+	/* Don't map private memslots for stolen bits */
+	if (!slot || (sp->gfn_stolen_bits && slot->id >= KVM_USER_MEM_SLOTS))
 		return -1;
 
 	ret = gfn_to_page_many_atomic(slot, gfn, pages, end - start);
@@ -2827,7 +2849,9 @@  static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	struct kvm_shadow_walk_iterator it;
 	struct kvm_mmu_page *sp;
 	int level, req_level, ret;
-	gfn_t gfn = gpa >> PAGE_SHIFT;
+	gpa_t gpa_stolen_mask = vcpu_gpa_stolen_mask(vcpu);
+	gfn_t gfn = (gpa & ~gpa_stolen_mask) >> PAGE_SHIFT;
+	gfn_t gfn_stolen_bits = (gpa & gpa_stolen_mask) >> PAGE_SHIFT;
 	gfn_t base_gfn = gfn;
 
 	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
@@ -2852,8 +2876,9 @@  static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 
 		drop_large_spte(vcpu, it.sptep);
 		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-					      it.level - 1, true, ACC_ALL);
+			sp = __kvm_mmu_get_page(vcpu, base_gfn,
+						gfn_stolen_bits, it.addr,
+						it.level - 1, true, ACC_ALL);
 
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (is_tdp && huge_page_disallowed &&
@@ -3689,6 +3714,13 @@  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
 		return true;
 
+	/* Don't expose aliases for no slot GFNs or private memslots */
+	if ((cr2_or_gpa & vcpu_gpa_stolen_mask(vcpu)) &&
+	    !kvm_is_visible_memslot(slot)) {
+		*pfn = KVM_PFN_NOSLOT;
+		return false;
+	}
+
 	/* Don't expose private memslots to L2. */
 	if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
 		*pfn = KVM_PFN_NOSLOT;
@@ -3723,7 +3755,7 @@  static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	bool write = error_code & PFERR_WRITE_MASK;
 	bool map_writable;
 
-	gfn_t gfn = gpa >> PAGE_SHIFT;
+	gfn_t gfn = vcpu_gpa_to_gfn_unalias(vcpu, gpa);
 	unsigned long mmu_seq;
 	kvm_pfn_t pfn;
 	hva_t hva;
@@ -3833,7 +3865,7 @@  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	     max_level > PG_LEVEL_4K;
 	     max_level--) {
 		int page_num = KVM_PAGES_PER_HPAGE(max_level);
-		gfn_t base = (gpa >> PAGE_SHIFT) & ~(page_num - 1);
+		gfn_t base = vcpu_gpa_to_gfn_unalias(vcpu, gpa) & ~(page_num - 1);
 
 		if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
 			break;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index d64ccb417c60..c896ec9f3159 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -46,6 +46,7 @@  struct kvm_mmu_page {
 	 */
 	union kvm_mmu_page_role role;
 	gfn_t gfn;
+	gfn_t gfn_stolen_bits;
 
 	u64 *spt;
 	/* hold the gfn of each spte inside spt */
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 823a5919f9fa..439dc141391b 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -25,7 +25,8 @@ 
 	#define guest_walker guest_walker64
 	#define FNAME(name) paging##64_##name
 	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
-	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_ADDR_MASK(vcpu, lvl) (~vcpu_gpa_stolen_mask(vcpu) & \
+					     PT64_LVL_ADDR_MASK(lvl))
 	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT64_LEVEL_BITS
@@ -44,7 +45,7 @@ 
 	#define guest_walker guest_walker32
 	#define FNAME(name) paging##32_##name
 	#define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
-	#define PT_LVL_ADDR_MASK(lvl) PT32_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT32_LVL_ADDR_MASK(lvl)
 	#define PT_LVL_OFFSET_MASK(lvl) PT32_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT32_LEVEL_BITS
@@ -58,7 +59,7 @@ 
 	#define guest_walker guest_walkerEPT
 	#define FNAME(name) ept_##name
 	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
-	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT64_LVL_ADDR_MASK(lvl)
 	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT64_LEVEL_BITS
@@ -75,7 +76,7 @@ 
 #define PT_GUEST_ACCESSED_MASK (1 << PT_GUEST_ACCESSED_SHIFT)
 
 #define gpte_to_gfn_lvl FNAME(gpte_to_gfn_lvl)
-#define gpte_to_gfn(pte) gpte_to_gfn_lvl((pte), PG_LEVEL_4K)
+#define gpte_to_gfn(vcpu, pte) gpte_to_gfn_lvl(vcpu, pte, PG_LEVEL_4K)
 
 /*
  * The guest_walker structure emulates the behavior of the hardware page
@@ -96,9 +97,9 @@  struct guest_walker {
 	struct x86_exception fault;
 };
 
-static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
+static gfn_t gpte_to_gfn_lvl(struct kvm_vcpu *vcpu, pt_element_t gpte, int lvl)
 {
-	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
+	return (gpte & PT_LVL_ADDR_MASK(vcpu, lvl)) >> PAGE_SHIFT;
 }
 
 static inline void FNAME(protect_clean_gpte)(struct kvm_mmu *mmu, unsigned *access,
@@ -366,7 +367,7 @@  static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 		--walker->level;
 
 		index = PT_INDEX(addr, walker->level);
-		table_gfn = gpte_to_gfn(pte);
+		table_gfn = gpte_to_gfn(vcpu, pte);
 		offset    = index * sizeof(pt_element_t);
 		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
 
@@ -432,7 +433,7 @@  static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	if (unlikely(errcode))
 		goto error;
 
-	gfn = gpte_to_gfn_lvl(pte, walker->level);
+	gfn = gpte_to_gfn_lvl(vcpu, pte, walker->level);
 	gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;
 
 	if (PTTYPE == 32 && walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
@@ -537,12 +538,14 @@  FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 
+	WARN_ON(gpte & vcpu_gpa_stolen_mask(vcpu));
+
 	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
 
 	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
 
-	gfn = gpte_to_gfn(gpte);
+	gfn = gpte_to_gfn(vcpu, gpte);
 	pte_access = sp->role.access & FNAME(gpte_access)(gpte);
 	FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
 	pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
@@ -652,6 +655,8 @@  static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 
 	direct_access = gw->pte_access;
 
+	WARN_ON(addr & vcpu_gpa_stolen_mask(vcpu));
+
 	top_level = vcpu->arch.mmu->root_level;
 	if (top_level == PT32E_ROOT_LEVEL)
 		top_level = PT32_ROOT_LEVEL;
@@ -1067,7 +1072,7 @@  static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 			continue;
 		}
 
-		gfn = gpte_to_gfn(gpte);
+		gfn = gpte_to_gfn(vcpu, gpte);
 		pte_access = sp->role.access;
 		pte_access &= FNAME(gpte_access)(gpte);
 		FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);