diff mbox series

[02/23] KVM: x86/mmu: Derive shadow MMU page role from parent

Message ID 20220203010051.2813563-3-dmatlack@google.com (mailing list archive)
State New, archived
Headers show
Series Extend Eager Page Splitting to the shadow MMU | expand

Commit Message

David Matlack Feb. 3, 2022, 1 a.m. UTC
Instead of computing the shadow page role from scratch for every new
page, we can derive most of the information from the parent shadow page.
This avoids redundant calculations such as the quadrant, and reduces the
number of parameters to kvm_mmu_get_page().

Preemptivel split out the role calculation to a separate function for
use in a following commit.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 71 ++++++++++++++++++++++------------
 arch/x86/kvm/mmu/paging_tmpl.h |  9 +++--
 2 files changed, 51 insertions(+), 29 deletions(-)

Comments

Sean Christopherson Feb. 19, 2022, 1:14 a.m. UTC | #1
On Thu, Feb 03, 2022, David Matlack wrote:
> Instead of computing the shadow page role from scratch for every new
> page, we can derive most of the information from the parent shadow page.
> This avoids redundant calculations such as the quadrant, and reduces the

Uh, calculating quadrant isn't redundant.  The quadrant forces KVM to use different
(multiple) shadow pages to shadow a single guest PTE when the guest is using 32-bit
paging (1024 PTEs per page table vs. 512 PTEs per page table).  The reason quadrant
is "quad" and not more or less is because 32-bit paging has two levels.  First-level
PTEs can have quadrant=0/1, and that gets doubled for second-level PTEs because we
need to use four PTEs (two to handle 2x guest PTEs, and each of those needs to be
unique for the first-level PTEs they point at).

Indeed, this fails spectacularly when attempting to boot a 32-bit non-PAE kernel
with shadow paging enabled.

 \���	���\���	��\���
 	P��\��`
 BUG: unable to handle page fault for address: ff9fa81c
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 *pde = 00000000
 ����
 Oops: 0000 [#1]��<���� SMP��<������<������<����
 ��<����CPU: 0 PID: 0 Comm: swapper ��<����G        W         5.12.0 #10
 ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
 ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
 ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
 ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
 ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
 ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600
 ��<����Call Trace:
 ��<���� ? printkd�r
 ��<���� ��<����memblock_reserved�r
 ��<���� ? 0xc2000000
 ��<���� ��<����setup_archd�r
 ��<���� ? vprintk_defaultd�r
 ��<���� ? vprintkd�r
 ��<���� ��<����start_kerneld�r
 ��<���� ��<����i386_start_kerneld�r
 ��<���� ��<����startup_32_smpd�r

 ����
 CR2: 00000000ff9fa81c

 ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
 ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
 ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
 ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
 ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
 ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600

> number of parameters to kvm_mmu_get_page().
> 
> Preemptivel split out the role calculation to a separate function for

Preemptively.

> use in a following commit.
> 
> No functional change intended.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
David Matlack Feb. 24, 2022, 6:45 p.m. UTC | #2
On Fri, Feb 18, 2022 at 5:14 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Feb 03, 2022, David Matlack wrote:
> > Instead of computing the shadow page role from scratch for every new
> > page, we can derive most of the information from the parent shadow page.
> > This avoids redundant calculations such as the quadrant, and reduces the
>
> Uh, calculating quadrant isn't redundant.  The quadrant forces KVM to use different
> (multiple) shadow pages to shadow a single guest PTE when the guest is using 32-bit
> paging (1024 PTEs per page table vs. 512 PTEs per page table).  The reason quadrant
> is "quad" and not more or less is because 32-bit paging has two levels.  First-level
> PTEs can have quadrant=0/1, and that gets doubled for second-level PTEs because we
> need to use four PTEs (two to handle 2x guest PTEs, and each of those needs to be
> unique for the first-level PTEs they point at).
>
> Indeed, this fails spectacularly when attempting to boot a 32-bit non-PAE kernel
> with shadow paging enabled.

*facepalm*

Thanks for catching this. I'll fix this up in v2 and add 32-bit
non-PAE guests with shadow paging to my test matrix.

>
>  \���   ���\��� ��\���
>         P��\��`
>  BUG: unable to handle page fault for address: ff9fa81c
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  *pde = 00000000
>  ����
>  Oops: 0000 [#1]��<���� SMP��<������<������<����
>  ��<����CPU: 0 PID: 0 Comm: swapper ��<����G        W         5.12.0 #10
>  ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
>  ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
>  ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
>  ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
>  ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
>  ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600
>  ��<����Call Trace:
>  ��<���� ? printkd�r
>  ��<���� ��<����memblock_reserved�r
>  ��<���� ? 0xc2000000
>  ��<���� ��<����setup_archd�r
>  ��<���� ? vprintk_defaultd�r
>  ��<���� ? vprintkd�r
>  ��<���� ��<����start_kerneld�r
>  ��<���� ��<����i386_start_kerneld�r
>  ��<���� ��<����startup_32_smpd�r
>
>  ����
>  CR2: 00000000ff9fa81c
>
>  ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
>  ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
>  ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
>  ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
>  ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
>  ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600
>
> > number of parameters to kvm_mmu_get_page().
> >
> > Preemptivel split out the role calculation to a separate function for
>
> Preemptively.
>
> > use in a following commit.
> >
> > No functional change intended.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
David Matlack March 4, 2022, 12:22 a.m. UTC | #3
On Sat, Feb 19, 2022 at 01:14:16AM +0000, Sean Christopherson wrote:
> On Thu, Feb 03, 2022, David Matlack wrote:
> > Instead of computing the shadow page role from scratch for every new
> > page, we can derive most of the information from the parent shadow page.
> > This avoids redundant calculations such as the quadrant, and reduces the
> 
> Uh, calculating quadrant isn't redundant.  The quadrant forces KVM to use different
> (multiple) shadow pages to shadow a single guest PTE when the guest is using 32-bit
> paging (1024 PTEs per page table vs. 512 PTEs per page table).  The reason quadrant
> is "quad" and not more or less is because 32-bit paging has two levels.  First-level
> PTEs can have quadrant=0/1, and that gets doubled for second-level PTEs because we
> need to use four PTEs (two to handle 2x guest PTEs, and each of those needs to be
> unique for the first-level PTEs they point at).

One solution is to keep the quadrant calculation in kvm_mmu_get_page().
The obvious problem for eager page splitting is we need the faulting
address to use the existing calculation to get the quadrant, and there
is no faulting address when doing eager page splitting. This doesn't
really matter though because we really don't care about eagerly
splitting huge pages that are shadowing a 32-bit non-PAE guest, so we
can just skip huge pages mapped on shadow pages with has_4_byte_gpte and
hard-code the quadrant to 0.

Plumbing all that shouldn't be too hard. But it occurs to me it might
not be necessary. The quadrant cannot be literally copied from the
parent SP like this commit does, but I think it can still be derived
from the parent. The upside is we don't need any special casing of
has_4_byte_gpte or hard-coding the quadrant in the eager page splitting
code, and we can still get rid of passing in the faulting address to
kvm_mmu_get_page().

Here's what it would (roughly) look like, applied on top of this commit:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6941b9b99a90..4184662b42bf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2110,9 +2110,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
        return sp;
 }

-static union kvm_mmu_page_role kvm_mmu_child_role(struct kvm_mmu_page *parent_sp,
-                                                 bool direct, u32 access)
+static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, u32 access)
 {
+       struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
        union kvm_mmu_page_role role;

        role = parent_sp->role;
@@ -2120,6 +2120,28 @@ static union kvm_mmu_page_role kvm_mmu_child_role(struct kvm_mmu_page *parent_sp
        role.access = access;
        role.direct = direct;

+       /*
+        * If the guest has 4-byte PTEs then that means it's using 32-bit,
+        * 2-level, non-PAE paging. KVM shadows such guests using 4 PAE page
+        * directories, each mapping 1/4 of the guest's linear address space
+        * (1GiB). The shadow pages for those 4 page directories are
+        * pre-allocated and assigned a separate quadrant in their role.
+        *
+        * Since we are allocating a child shadow page and there are only 2
+        * levels, this must be a PG_LEVEL_4K shadow page. Here the quadrant
+        * will either be 0 or 1 because it maps 1/2 of the address space mapped
+        * by the guest's PG_LEVEL_4K page table (or 4MiB huge page) that it
+        * is shadowing. In this case, the quadrant can be derived by the index
+        * of the SPTE that points to the new child shadow page in the page
+        * directory (parent_sp). Specifically, every 2 SPTEs in parent_sp
+        * shadow one half of a guest's page table (or 4MiB huge page) so the
+        * quadrant is just the parity of the index of the SPTE.
+        */
+       if (role.has_4_byte_gpte) {
+               BUG_ON(role.level != PG_LEVEL_4K);
+               role.quadrant = (sptep - parent_sp->spt) % 2;
+       }
+
        return role;
 }

@@ -2127,11 +2149,9 @@ static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
                                                 u64 *sptep, gfn_t gfn,
                                                 bool direct, u32 access)
 {
-       struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
        union kvm_mmu_page_role role;

-       role = kvm_mmu_child_role(parent_sp, direct, access);
-
+       role = kvm_mmu_child_role(sptep, direct, access);
        return kvm_mmu_get_page(vcpu, gfn, role);
 }

> 
> Indeed, this fails spectacularly when attempting to boot a 32-bit non-PAE kernel
> with shadow paging enabled.
> 
>  \���	���\���	��\���
>  	P��\��`
>  BUG: unable to handle page fault for address: ff9fa81c
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  *pde = 00000000
>  ����
>  Oops: 0000 [#1]��<���� SMP��<������<������<����
>  ��<����CPU: 0 PID: 0 Comm: swapper ��<����G        W         5.12.0 #10
>  ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
>  ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
>  ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
>  ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
>  ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
>  ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600
>  ��<����Call Trace:
>  ��<���� ? printkd�r
>  ��<���� ��<����memblock_reserved�r
>  ��<���� ? 0xc2000000
>  ��<���� ��<����setup_archd�r
>  ��<���� ? vprintk_defaultd�r
>  ��<���� ? vprintkd�r
>  ��<���� ��<����start_kerneld�r
>  ��<���� ��<����i386_start_kerneld�r
>  ��<���� ��<����startup_32_smpd�r
> 
>  ����
>  CR2: 00000000ff9fa81c
> 
>  ��<����EIP: memblock_add_range.isra.18.constprop.23d�r
>  ��<����Code: <83> 79 04 00 75 2c 83 38 01 75 06 83 78 08 00 74 02 0f 0b 89 11 8b
>  ��<����EAX: c2af24bc EBX: fdffffff ECX: ff9fa818 EDX: 02000000
>  ��<����ESI: 02000000 EDI: 00000000 EBP: c2909f30 ESP: c2909f0c
>  ��<����DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210006
>  ��<����CR0: 80050033 CR2: ff9fa81c CR3: 02b76000 CR4: 00040600
> 
> > number of parameters to kvm_mmu_get_page().
> > 
> > Preemptivel split out the role calculation to a separate function for
> 
> Preemptively.
> 
> > use in a following commit.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6ca38277f2ab..fc9a4d9c0ddd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2045,30 +2045,14 @@  static void clear_sp_write_flooding_count(u64 *spte)
 	__clear_sp_write_flooding_count(sptep_to_sp(spte));
 }
 
-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-					     gfn_t gfn,
-					     gva_t gaddr,
-					     unsigned level,
-					     int direct,
-					     unsigned int access)
+static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					     union kvm_mmu_page_role role)
 {
-	union kvm_mmu_page_role role;
 	struct hlist_head *sp_list;
-	unsigned quadrant;
 	struct kvm_mmu_page *sp;
 	int collisions = 0;
 	LIST_HEAD(invalid_list);
 
-	role = vcpu->arch.mmu->mmu_role.base;
-	role.level = level;
-	role.direct = direct;
-	role.access = access;
-	if (role.has_4_byte_gpte) {
-		quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
-		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
-		role.quadrant = quadrant;
-	}
-
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
@@ -2086,7 +2070,7 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			 * Unsync pages must not be left as is, because the new
 			 * upper-level page will be write-protected.
 			 */
-			if (level > PG_LEVEL_4K && sp->unsync)
+			if (role.level > PG_LEVEL_4K && sp->unsync)
 				kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 							 &invalid_list);
 			continue;
@@ -2125,14 +2109,14 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
+	sp = kvm_mmu_alloc_page(vcpu, role.direct);
 
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
-	if (!direct) {
+	if (!role.direct) {
 		account_shadowed(vcpu->kvm, sp);
-		if (level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
+		if (role.level == PG_LEVEL_4K && kvm_vcpu_write_protect_gfn(vcpu, gfn))
 			kvm_flush_remote_tlbs_with_address(vcpu->kvm, gfn, 1);
 	}
 	trace_kvm_mmu_get_page(sp, true);
@@ -2144,6 +2128,31 @@  static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	return sp;
 }
 
+static union kvm_mmu_page_role kvm_mmu_child_role(struct kvm_mmu_page *parent_sp,
+						  bool direct, u32 access)
+{
+	union kvm_mmu_page_role role;
+
+	role = parent_sp->role;
+	role.level--;
+	role.access = access;
+	role.direct = direct;
+
+	return role;
+}
+
+static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
+						 u64 *sptep, gfn_t gfn,
+						 bool direct, u32 access)
+{
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
+	union kvm_mmu_page_role role;
+
+	role = kvm_mmu_child_role(parent_sp, direct, access);
+
+	return kvm_mmu_get_page(vcpu, gfn, role);
+}
+
 static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
 					struct kvm_vcpu *vcpu, hpa_t root,
 					u64 addr)
@@ -2942,8 +2951,7 @@  static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (is_shadow_present_pte(*it.sptep))
 			continue;
 
-		sp = kvm_mmu_get_page(vcpu, base_gfn, it.addr,
-				      it.level - 1, true, ACC_ALL);
+		sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed &&
@@ -3325,9 +3333,22 @@  static int mmu_check_root(struct kvm_vcpu *vcpu, gfn_t root_gfn)
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
 			    u8 level, bool direct)
 {
+	union kvm_mmu_page_role role;
 	struct kvm_mmu_page *sp;
+	unsigned int quadrant;
+
+	role = vcpu->arch.mmu->mmu_role.base;
+	role.level = level;
+	role.direct = direct;
+	role.access = ACC_ALL;
+
+	if (role.has_4_byte_gpte) {
+		quadrant = gva >> (PAGE_SHIFT + (PT64_PT_BITS * level));
+		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
+		role.quadrant = quadrant;
+	}
 
-	sp = kvm_mmu_get_page(vcpu, gfn, gva, level, direct, ACC_ALL);
+	sp = kvm_mmu_get_page(vcpu, gfn, role);
 	++sp->root_count;
 
 	return __pa(sp->spt);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 5b5bdac97c7b..f93d4423a067 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -683,8 +683,9 @@  static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
 			access = gw->pt_access[it.level - 2];
-			sp = kvm_mmu_get_page(vcpu, table_gfn, fault->addr,
-					      it.level-1, false, access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
+						  false, access);
+
 			/*
 			 * We must synchronize the pagetable before linking it
 			 * because the guest doesn't need to flush tlb when
@@ -740,8 +741,8 @@  static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		drop_large_spte(vcpu, it.sptep);
 
 		if (!is_shadow_present_pte(*it.sptep)) {
-			sp = kvm_mmu_get_page(vcpu, base_gfn, fault->addr,
-					      it.level - 1, true, direct_access);
+			sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
+						  true, direct_access);
 			link_shadow_page(vcpu, it.sptep, sp);
 			if (fault->huge_page_disallowed &&
 			    fault->req_level >= it.level)