diff mbox series

[RFC,4/6] KVM: X86: Introduce role.level_promoted

Message ID 20211210092508.7185-5-jiangshanlai@gmail.com (mailing list archive)
State New, archived
Headers show
Series KVM: X86: Add and use shadow page with level promoted or acting as pae_root | expand

Commit Message

Lai Jiangshan Dec. 10, 2021, 9:25 a.m. UTC
From: Lai Jiangshan <laijs@linux.alibaba.com>

Level pormotion occurs when mmu->shadow_root_level > mmu->root_level.

There are several cases that can cuase level pormotion:

shadow mmu (shadow paging for 32 bit guest):
	case1:	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0

shadow nested NPT (for 32bit L1 hypervisor):
	case2:	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0
	case3:	gCR0_PG=1,gEFER_LMA=0,hEFER_LMA=1

shadow nested NPT (for 64bit L1 hypervisor):
	case4:	gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1

When level pormotion occurs (32bit guest, case1-3), special roots are
often used.  But case4 is not using special roots.  It uses shadow page
without fully aware of the specialty.  It might work accidentally:
	1) The root_page (root_sp->spt) is allocated with level = 5,
	   and root_sp->spt[0] is allocated with the same gfn and the
	   same role except role.level = 4.  Luckly that they are
	   different shadow pages.
	2) FNAME(walk_addr_generic) sets walker->table_gfn[4] and
	   walker->pt_access[4], which are normally unused when
	   mmu->shadow_root_level == mmu->root_level == 4, so that
	   FNAME(fetch) can use them to allocate shadow page for
	   root_sp->spt[0] and link them when shadow_root_level == 5.

But it has problems.
If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
and usees the same gfn as the root for the nNPT before and after
switching gCR4_LA57.  The host (hCR4_LA57=1) wold use the same root_sp
for guest even guest switches gCR4_LA57.  The guest will see unexpected
page mapped and L2 can hurts L1.  It is lucky the the problem can't
hurt L0.

The root_sp should be like role.direct=1 sometimes: its contents are
not backed by gptes, root_sp->gfns is meaningless.  For a normal high
level sp, sp->gfns is often unused and kept zero, but it could be
relevant and meaningful when sp->gfns is used because they are back
by concret gptes.  For level-promoted root_sp described before, root_sp
is just a portal to contribute root_sp->spt[0], and root_sp should not
have root_sp->gfns and root_sp->spt[0] should not be dropped if gpte[0]
of the root gfn is changed.

This patch adds role.level_promoted to address the two problems.
role.level_promoted is set when shadow paging and
role.level > gMMU.level.

An alternative way to fix the problem of case4 is that: also using the
special root pml5_root for it.  But it would required to change many
other places because it is assumption that special roots is only used
for 32bit guests.

This patch also paves the way to use level promoted shadow page for
case1-3, but that requires the special handling or PAE paging, so the
extensive usage of it is not included.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu/mmu.c          | 15 +++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h  |  1 +
 3 files changed, 16 insertions(+), 3 deletions(-)

Comments

Sean Christopherson Jan. 4, 2022, 10:14 p.m. UTC | #1
On Fri, Dec 10, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Level pormotion occurs when mmu->shadow_root_level > mmu->root_level.

s/pormotion/promotion (in multiple places)

That said, I strongly prefer "passthrough" or maybe "nop" over "level_promoted".
Promoted in the context of page level usually refers to making a hugepage, which
is not the case here.

> 
> There are several cases that can cuase level pormotion:
> 
> shadow mmu (shadow paging for 32 bit guest):
> 	case1:	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0
> 
> shadow nested NPT (for 32bit L1 hypervisor):
> 	case2:	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0
> 	case3:	gCR0_PG=1,gEFER_LMA=0,hEFER_LMA=1
> 
> shadow nested NPT (for 64bit L1 hypervisor):
> 	case4:	gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1
> 
> When level pormotion occurs (32bit guest, case1-3), special roots are
> often used.  But case4 is not using special roots.  It uses shadow page
> without fully aware of the specialty.  It might work accidentally:
> 	1) The root_page (root_sp->spt) is allocated with level = 5,
> 	   and root_sp->spt[0] is allocated with the same gfn and the
> 	   same role except role.level = 4.  Luckly that they are
> 	   different shadow pages.
> 	2) FNAME(walk_addr_generic) sets walker->table_gfn[4] and
> 	   walker->pt_access[4], which are normally unused when
> 	   mmu->shadow_root_level == mmu->root_level == 4, so that
> 	   FNAME(fetch) can use them to allocate shadow page for
> 	   root_sp->spt[0] and link them when shadow_root_level == 5.
> 
> But it has problems.
> If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
> and usees the same gfn as the root for the nNPT before and after
> switching gCR4_LA57.  The host (hCR4_LA57=1) wold use the same root_sp
> for guest even guest switches gCR4_LA57.  The guest will see unexpected
> page mapped and L2 can hurts L1.  It is lucky the the problem can't
> hurt L0.
> 
> The root_sp should be like role.direct=1 sometimes: its contents are
> not backed by gptes, root_sp->gfns is meaningless.  For a normal high
> level sp, sp->gfns is often unused and kept zero, but it could be
> relevant and meaningful when sp->gfns is used because they are back
> by concret gptes.  For level-promoted root_sp described before, root_sp
> is just a portal to contribute root_sp->spt[0], and root_sp should not
> have root_sp->gfns and root_sp->spt[0] should not be dropped if gpte[0]
> of the root gfn is changed.
> 
> This patch adds role.level_promoted to address the two problems.
> role.level_promoted is set when shadow paging and
> role.level > gMMU.level.
> 
> An alternative way to fix the problem of case4 is that: also using the
> special root pml5_root for it.  But it would required to change many
> other places because it is assumption that special roots is only used
> for 32bit guests.
> 
> This patch also paves the way to use level promoted shadow page for
> case1-3, but that requires the special handling or PAE paging, so the
> extensive usage of it is not included.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 ++-
>  arch/x86/kvm/mmu/mmu.c          | 15 +++++++++++++--
>  arch/x86/kvm/mmu/paging_tmpl.h  |  1 +
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 88ecf53f0d2b..6465c83794fc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -334,7 +334,8 @@ union kvm_mmu_page_role {
>  		unsigned smap_andnot_wp:1;
>  		unsigned ad_disabled:1;
>  		unsigned guest_mode:1;
> -		unsigned :6;
> +		unsigned level_promoted:1;
> +		unsigned :5;

The massive comment above this union needs to be updated.

>  		/*
>  		 * This is left at the top of the word so that
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 54e7cbc15380..4769253e9024 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -767,6 +767,9 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
>  
>  static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  {
> +	if (sp->role.level_promoted)
> +		return sp->gfn;
> +
>  	if (!sp->role.direct)
>  		return sp->gfns[index];
>  
> @@ -776,6 +779,8 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
>  {
>  	if (!sp->role.direct) {
> +		if (WARN_ON_ONCE(sp->role.level_promoted && gfn != sp->gfn))
> +			return;
>  		sp->gfns[index] = gfn;

This is wrong, sp->gfns is NULL when sp->role.level_promoted is true.  I believe
you want something similar to the "get" flow, e.g.

	if (sp->role.passthrough) {
		WARN_ON_ONCE(gfn != sp->gfn);
		return;
	}

	if (!sp->role.direct) {
		...
	}

Alternatively, should we mark passthrough shadow pages as direct=1?  That would
naturally handle this code, and for things like reexecute_instruction()'s usage
of kvm_mmu_unprotect_page(), I don't think passthrough shadow pages should be
considered indirect, e.g. zapping them won't help and the shadow page can't become
unsync.

>  		return;
>  	}
> @@ -1702,7 +1707,7 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>  	hlist_del(&sp->hash_link);
>  	list_del(&sp->link);
>  	free_page((unsigned long)sp->spt);
> -	if (!sp->role.direct)
> +	if (!sp->role.direct && !sp->role.level_promoted)

Hmm, at this point, it may be better to just omit the check entirely.  @sp is
zero allocated and free_page() plays nice with NULL "pointers".  I believe TDX
and maybe SNP support will need to do more work here, i.e. may add back a similar
check, but if so then checking sp->gfns directly for !NULL is preferable.

>  		free_page((unsigned long)sp->gfns);
>  	kmem_cache_free(mmu_page_header_cache, sp);
>  }
> @@ -1740,7 +1745,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  
>  	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
>  	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> -	if (!role.direct)
> +	if (!(role.direct || role.level_promoted))

I personally prefer the style of the previous check, mainly because I'm horrible
at reading !(x || y).

	if (!role.direct && !role.passthrough)

>  		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>  
> @@ -2084,6 +2089,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
>  		role.quadrant = quadrant;
>  	}
> +	if (role.level_promoted && (level <= vcpu->arch.mmu->root_level))
> +		role.level_promoted = 0;
>  
>  	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>  	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> @@ -4836,6 +4843,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
>  
>  	role.base.direct = false;
>  	role.base.level = kvm_mmu_get_tdp_level(vcpu);
> +	if (role.base.level > role_regs_to_root_level(regs))
> +		role.base.level_promoted = 1;
>  
>  	return role;
>  }
> @@ -5228,6 +5237,8 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>  	kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
>  
>  	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
> +		if (sp->role.level_promoted)
> +			continue;
>  		if (detect_write_misaligned(sp, gpa, bytes) ||
>  		      detect_write_flooding(sp)) {
>  			kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 5c78300fc7d9..16ac276d342a 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1043,6 +1043,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		.level = 0xf,
>  		.access = 0x7,
>  		.quadrant = 0x3,
> +		.level_promoted = 0x1,
>  	};
>  
>  	/*
> -- 
> 2.19.1.6.gb485710b
>
Paolo Bonzini Feb. 11, 2022, 4:06 p.m. UTC | #2
On 1/4/22 23:14, Sean Christopherson wrote:
> Alternatively, should we mark passthrough shadow pages as direct=1?  That would
> naturally handle this code, and for things like reexecute_instruction()'s usage
> of kvm_mmu_unprotect_page(), I don't think passthrough shadow pages should be
> considered indirect, e.g. zapping them won't help and the shadow page can't become
> unsync.

So the main difference between direct and passthrough shadow pages is that
passthrough pages can have indirect children.  A direct page maps the
page at sp->gfn, while a passthrough page maps the page _table_ at
sp->gfn.

Is this correct?

If so, I think there is a difference between a passthrough page that
maps a level-2 page from level-4, and a passthrough page that maps a
level-3 page from level-4.  This means that a single bit in the role
is not enough.

One way to handle this could be to have a single field "direct_levels"
that subsumes both "direct" and "passthrough".  direct && !passthrough
would correspond to "direct_levels == level", while !direct && !passthrough
would correspond to "direct_levels == 0".

Paolo
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 88ecf53f0d2b..6465c83794fc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -334,7 +334,8 @@  union kvm_mmu_page_role {
 		unsigned smap_andnot_wp:1;
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
-		unsigned :6;
+		unsigned level_promoted:1;
+		unsigned :5;
 
 		/*
 		 * This is left at the top of the word so that
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 54e7cbc15380..4769253e9024 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -767,6 +767,9 @@  static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
+	if (sp->role.level_promoted)
+		return sp->gfn;
+
 	if (!sp->role.direct)
 		return sp->gfns[index];
 
@@ -776,6 +779,8 @@  static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
 {
 	if (!sp->role.direct) {
+		if (WARN_ON_ONCE(sp->role.level_promoted && gfn != sp->gfn))
+			return;
 		sp->gfns[index] = gfn;
 		return;
 	}
@@ -1702,7 +1707,7 @@  static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 	hlist_del(&sp->hash_link);
 	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
-	if (!sp->role.direct)
+	if (!sp->role.direct && !sp->role.level_promoted)
 		free_page((unsigned long)sp->gfns);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
@@ -1740,7 +1745,7 @@  static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	if (!role.direct)
+	if (!(role.direct || role.level_promoted))
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2084,6 +2089,8 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
 		role.quadrant = quadrant;
 	}
+	if (role.level_promoted && (level <= vcpu->arch.mmu->root_level))
+		role.level_promoted = 0;
 
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
@@ -4836,6 +4843,8 @@  kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
 
 	role.base.direct = false;
 	role.base.level = kvm_mmu_get_tdp_level(vcpu);
+	if (role.base.level > role_regs_to_root_level(regs))
+		role.base.level_promoted = 1;
 
 	return role;
 }
@@ -5228,6 +5237,8 @@  static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
 
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
+		if (sp->role.level_promoted)
+			continue;
 		if (detect_write_misaligned(sp, gpa, bytes) ||
 		      detect_write_flooding(sp)) {
 			kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 5c78300fc7d9..16ac276d342a 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1043,6 +1043,7 @@  static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		.level = 0xf,
 		.access = 0x7,
 		.quadrant = 0x3,
+		.level_promoted = 0x1,
 	};
 
 	/*