diff mbox series

[2/2] KVM: X86/MMU: Introduce role.passthrough for shadowing 5-level NPT for 4-level NPT L1 guest

Message ID 20220420131204.2850-3-jiangshanlai@gmail.com (mailing list archive)
State New, archived
Headers show
Series KVM: X86/MMU: Fix problem for shadowing 5-level NPT for 4-level NPT L1 guest | expand

Commit Message

Lai Jiangshan April 20, 2022, 1:12 p.m. UTC
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is
allocated with role.level = 5 and the guest pagetable's root gfn.

And root_sp->spt[0] is also allocated with the same gfn and the same
role except role.level = 4.  Luckily that they are different shadow
pages, but only root_sp->spt[0] is the real translation of the guest
pagetable.

Here comes a problem:

If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
and uses the same gfn as the root page for nested NPT before and after
switching gCR4_LA57.  The host (hCR4_LA57=1) might use the same root_sp
for the guest even the guest switches gCR4_LA57.  The guest will see
unexpected page mapped and L2 may exploit the bug and hurt L1.  It is
lucky that the problem can't hurt L0.

And three special cases need to be handled:

The root_sp should be like role.direct=1 sometimes: its contents are
not backed by gptes, root_sp->gfns is meaningless.  (For a normal high
level sp in shadow paging, sp->gfns is often unused and kept zero, but
it could be relevant and meaningful if sp->gfns is used because they
are backed by concrete gptes.)

For such root_sp in the case, root_sp is just a portal to contribute
root_sp->spt[0], and root_sp->gfns should not be used and
root_sp->spt[0] should not be dropped if gpte[0] of the guest root
pagetable is changed.

Such root_sp should not be accounted too.

So add role.passthrough to distinguish the shadow pages in the hash
when gCR4_LA57 is toggled and fix above special cases by using it in
kvm_mmu_page_{get|set}_gfn() and sp_has_gptes().

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 Documentation/virt/kvm/mmu.rst  |  3 +++
 arch/x86/include/asm/kvm_host.h |  5 +++--
 arch/x86/kvm/mmu/mmu.c          | 16 ++++++++++++++++
 arch/x86/kvm/mmu/paging_tmpl.h  |  1 +
 4 files changed, 23 insertions(+), 2 deletions(-)

Comments

Paolo Bonzini April 20, 2022, 5:15 p.m. UTC | #1
On 4/20/22 15:12, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is
> allocated with role.level = 5 and the guest pagetable's root gfn.
> 
> And root_sp->spt[0] is also allocated with the same gfn and the same
> role except role.level = 4.  Luckily that they are different shadow
> pages, but only root_sp->spt[0] is the real translation of the guest
> pagetable.
> 
> Here comes a problem:
> 
> If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
> and uses the same gfn as the root page for nested NPT before and after
> switching gCR4_LA57.  The host (hCR4_LA57=1) might use the same root_sp
> for the guest even the guest switches gCR4_LA57.  The guest will see
> unexpected page mapped and L2 may exploit the bug and hurt L1.  It is
> lucky that the problem can't hurt L0.
> 
> And three special cases need to be handled:
> 
> The root_sp should be like role.direct=1 sometimes: its contents are
> not backed by gptes, root_sp->gfns is meaningless.  (For a normal high
> level sp in shadow paging, sp->gfns is often unused and kept zero, but
> it could be relevant and meaningful if sp->gfns is used because they
> are backed by concrete gptes.)
> 
> For such root_sp in the case, root_sp is just a portal to contribute
> root_sp->spt[0], and root_sp->gfns should not be used and
> root_sp->spt[0] should not be dropped if gpte[0] of the guest root
> pagetable is changed.
> 
> Such root_sp should not be accounted too.
> 
> So add role.passthrough to distinguish the shadow pages in the hash
> when gCR4_LA57 is toggled and fix above special cases by using it in
> kvm_mmu_page_{get|set}_gfn() and sp_has_gptes().
> 
> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Looks good, and is easy to backport; thanks.

However, once we have also your series "KVM: X86/MMU: Use one-off 
special shadow page for special roots", perhaps the passthrough sp can 
be converted to a special root (but now with a shadow_page) as well?

Paolo

> ---
>   Documentation/virt/kvm/mmu.rst  |  3 +++
>   arch/x86/include/asm/kvm_host.h |  5 +++--
>   arch/x86/kvm/mmu/mmu.c          | 16 ++++++++++++++++
>   arch/x86/kvm/mmu/paging_tmpl.h  |  1 +
>   4 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst
> index 5b1ebad24c77..4018b9d7a0d3 100644
> --- a/Documentation/virt/kvm/mmu.rst
> +++ b/Documentation/virt/kvm/mmu.rst
> @@ -202,6 +202,9 @@ Shadow pages contain the following information:
>       Is 1 if the MMU instance cannot use A/D bits.  EPT did not have A/D
>       bits before Haswell; shadow EPT page tables also cannot use A/D bits
>       if the L1 hypervisor does not enable them.
> +  role.passthrough:
> +    Is 1 if role.level = 5 when shadowing 5-level shadow NPT for
> +    4-level NPT L1.
>     gfn:
>       Either the guest page table containing the translations shadowed by this
>       page, or the base page frame for linear translations.  See role.direct.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9694dd5e6ccc..d4f8f4784d87 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -285,7 +285,7 @@ struct kvm_kernel_irq_routing_entry;
>    * minimize the size of kvm_memory_slot.arch.gfn_track, i.e. allows allocating
>    * 2 bytes per gfn instead of 4 bytes per gfn.
>    *
> - * Indirect upper-level shadow pages are tracked for write-protection via
> + * Upper-level shadow pages having gptes are tracked for write-protection via
>    * gfn_track.  As above, gfn_track is a 16 bit counter, so KVM must not create
>    * more than 2^16-1 upper-level shadow pages at a single gfn, otherwise
>    * gfn_track will overflow and explosions will ensure.
> @@ -331,7 +331,8 @@ union kvm_mmu_page_role {
>   		unsigned smap_andnot_wp:1;
>   		unsigned ad_disabled:1;
>   		unsigned guest_mode:1;
> -		unsigned :6;
> +		unsigned passthrough:1;
> +		unsigned :5;
>   
>   		/*
>   		 * This is left at the top of the word so that
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1bdff55218ef..d14cb6f99cb1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -737,6 +737,9 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
>   
>   static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>   {
> +	if (sp->role.passthrough)
> +		return sp->gfn;
> +
>   	if (!sp->role.direct)
>   		return sp->gfns[index];
>   
> @@ -745,6 +748,11 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>   
>   static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
>   {
> +	if (sp->role.passthrough) {
> +		WARN_ON_ONCE(gfn != sp->gfn);
> +		return;
> +	}
> +
>   	if (!sp->role.direct) {
>   		sp->gfns[index] = gfn;
>   		return;
> @@ -1861,6 +1869,9 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)
>   	if (sp->role.direct)
>   		return false;
>   
> +	if (sp->role.passthrough)
> +		return false;
> +
>   	return true;
>   }
>   
> @@ -2059,6 +2070,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
>   		role.quadrant = quadrant;
>   	}
> +	if (level <= vcpu->arch.mmu->root_level)
> +		role.passthrough = 0;
>   
>   	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>   	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> @@ -4890,6 +4903,9 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
>   
>   	role.base.direct = false;
>   	role.base.level = kvm_mmu_get_tdp_level(vcpu);
> +	if (role.base.level == PT64_ROOT_5LEVEL &&
> +	    role_regs_to_root_level(regs) == PT64_ROOT_4LEVEL)
> +		role.base.passthrough = 1;
>   
>   	return role;
>   }
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 8621188b46df..c1b975fb85a2 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1042,6 +1042,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>   		.level = 0xf,
>   		.access = 0x7,
>   		.quadrant = 0x3,
> +		.passthrough = 0x1,
>   	};
>   
>   	/*
Lai Jiangshan April 21, 2022, 12:23 a.m. UTC | #2
On Thu, Apr 21, 2022 at 1:15 AM Paolo Bonzini <pbonzini@redhat.com> wrote:

>
> Looks good, and is easy to backport; thanks.
>
> However, once we have also your series "KVM: X86/MMU: Use one-off
> special shadow page for special roots", perhaps the passthrough sp can
> be converted to a special root (but now with a shadow_page) as well?
>

It will be much more complicated.

The current code allows fast_pgd_switch() for shadowing 5-level NPT
for 4-level NPT L1 guests because both the host and the guest are
64 bit. Using a special shadow page will disallow fast_pgd_switch().

And it will make using_special_root_page() more complicated.

It will require a different kind of special shadow page to make it
able to do fast switch.
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst
index 5b1ebad24c77..4018b9d7a0d3 100644
--- a/Documentation/virt/kvm/mmu.rst
+++ b/Documentation/virt/kvm/mmu.rst
@@ -202,6 +202,9 @@  Shadow pages contain the following information:
     Is 1 if the MMU instance cannot use A/D bits.  EPT did not have A/D
     bits before Haswell; shadow EPT page tables also cannot use A/D bits
     if the L1 hypervisor does not enable them.
+  role.passthrough:
+    Is 1 if role.level = 5 when shadowing 5-level shadow NPT for
+    4-level NPT L1.
   gfn:
     Either the guest page table containing the translations shadowed by this
     page, or the base page frame for linear translations.  See role.direct.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9694dd5e6ccc..d4f8f4784d87 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -285,7 +285,7 @@  struct kvm_kernel_irq_routing_entry;
  * minimize the size of kvm_memory_slot.arch.gfn_track, i.e. allows allocating
  * 2 bytes per gfn instead of 4 bytes per gfn.
  *
- * Indirect upper-level shadow pages are tracked for write-protection via
+ * Upper-level shadow pages having gptes are tracked for write-protection via
  * gfn_track.  As above, gfn_track is a 16 bit counter, so KVM must not create
  * more than 2^16-1 upper-level shadow pages at a single gfn, otherwise
  * gfn_track will overflow and explosions will ensure.
@@ -331,7 +331,8 @@  union kvm_mmu_page_role {
 		unsigned smap_andnot_wp:1;
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
-		unsigned :6;
+		unsigned passthrough:1;
+		unsigned :5;
 
 		/*
 		 * This is left at the top of the word so that
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1bdff55218ef..d14cb6f99cb1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -737,6 +737,9 @@  static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
+	if (sp->role.passthrough)
+		return sp->gfn;
+
 	if (!sp->role.direct)
 		return sp->gfns[index];
 
@@ -745,6 +748,11 @@  static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 
 static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
 {
+	if (sp->role.passthrough) {
+		WARN_ON_ONCE(gfn != sp->gfn);
+		return;
+	}
+
 	if (!sp->role.direct) {
 		sp->gfns[index] = gfn;
 		return;
@@ -1861,6 +1869,9 @@  static bool sp_has_gptes(struct kvm_mmu_page *sp)
 	if (sp->role.direct)
 		return false;
 
+	if (sp->role.passthrough)
+		return false;
+
 	return true;
 }
 
@@ -2059,6 +2070,8 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
 		role.quadrant = quadrant;
 	}
+	if (level <= vcpu->arch.mmu->root_level)
+		role.passthrough = 0;
 
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
@@ -4890,6 +4903,9 @@  kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
 
 	role.base.direct = false;
 	role.base.level = kvm_mmu_get_tdp_level(vcpu);
+	if (role.base.level == PT64_ROOT_5LEVEL &&
+	    role_regs_to_root_level(regs) == PT64_ROOT_4LEVEL)
+		role.base.passthrough = 1;
 
 	return role;
 }
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 8621188b46df..c1b975fb85a2 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1042,6 +1042,7 @@  static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		.level = 0xf,
 		.access = 0x7,
 		.quadrant = 0x3,
+		.passthrough = 0x1,
 	};
 
 	/*