diff mbox series

[v2,10/10] KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2

Message ID 20220517190524.2202762-11-dmatlack@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: selftests: Add nested support to dirty_log_perf_test | expand

Commit Message

David Matlack May 17, 2022, 7:05 p.m. UTC
Add an option to dirty_log_perf_test that configures the vCPUs to run in
L2 instead of L1. This makes it possible to benchmark the dirty logging
performance of nested virtualization, which is particularly interesting
because KVM must shadow L1's EPT/NPT tables.

For now this support only works on x86_64 CPUs with VMX. Otherwise
passing -n results in the test being skipped.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 tools/testing/selftests/kvm/Makefile          |  1 +
 .../selftests/kvm/dirty_log_perf_test.c       | 10 +-
 .../selftests/kvm/include/perf_test_util.h    |  7 ++
 .../selftests/kvm/include/x86_64/vmx.h        |  3 +
 .../selftests/kvm/lib/perf_test_util.c        | 29 +++++-
 .../selftests/kvm/lib/x86_64/perf_test_util.c | 98 +++++++++++++++++++
 tools/testing/selftests/kvm/lib/x86_64/vmx.c  | 13 +++
 7 files changed, 154 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c

Comments

Peter Xu May 17, 2022, 8:20 p.m. UTC | #1
On Tue, May 17, 2022 at 07:05:24PM +0000, David Matlack wrote:
> +uint64_t perf_test_nested_pages(int nr_vcpus)
> +{
> +	/*
> +	 * 513 page tables to identity-map the L2 with 1G pages, plus a few
> +	 * pages per-vCPU for data structures such as the VMCS.
> +	 */
> +	return 513 + 10 * nr_vcpus;

Shouldn't that 513 magic value be related to vm->max_gfn instead (rather
than assuming all hosts have 39 bits PA)?

If my math is correct, it'll require 1GB here just for the l2->l1 pgtables
on a 5-level host to run this test nested. So I had a feeling we'd better
still consider >4 level hosts some day very soon..  No strong opinion, as
long as this test is not run by default.

> +}
Peter Xu May 18, 2022, 1:51 p.m. UTC | #2
On Tue, May 17, 2022 at 04:20:31PM -0400, Peter Xu wrote:
> On Tue, May 17, 2022 at 07:05:24PM +0000, David Matlack wrote:
> > +uint64_t perf_test_nested_pages(int nr_vcpus)
> > +{
> > +	/*
> > +	 * 513 page tables to identity-map the L2 with 1G pages, plus a few
> > +	 * pages per-vCPU for data structures such as the VMCS.
> > +	 */
> > +	return 513 + 10 * nr_vcpus;
> 
> Shouldn't that 513 magic value be related to vm->max_gfn instead (rather
> than assuming all hosts have 39 bits PA)?
> 
> If my math is correct, it'll require 1GB here just for the l2->l1 pgtables
> on a 5-level host to run this test nested. So I had a feeling we'd better
> still consider >4 level hosts some day very soon..  No strong opinion, as
> long as this test is not run by default.

I had a feeling that when I said N level I actually meant N-1 level in all
above, since 39 bits are for 3 level not 4 level?..

Then it's ~512GB pgtables on 5 level?  If so I do think we'd better have a
nicer way to do this identity mapping..

I don't think it's very hard - walk the mem regions in kvm_vm.regions
should work for us?
Sean Christopherson May 18, 2022, 3:24 p.m. UTC | #3
On Wed, May 18, 2022, Peter Xu wrote:
> On Tue, May 17, 2022 at 04:20:31PM -0400, Peter Xu wrote:
> > On Tue, May 17, 2022 at 07:05:24PM +0000, David Matlack wrote:
> > > +uint64_t perf_test_nested_pages(int nr_vcpus)
> > > +{
> > > +	/*
> > > +	 * 513 page tables to identity-map the L2 with 1G pages, plus a few
> > > +	 * pages per-vCPU for data structures such as the VMCS.
> > > +	 */
> > > +	return 513 + 10 * nr_vcpus;
> > 
> > Shouldn't that 513 magic value be related to vm->max_gfn instead (rather
> > than assuming all hosts have 39 bits PA)?
> > 
> > If my math is correct, it'll require 1GB here just for the l2->l1 pgtables
> > on a 5-level host to run this test nested. So I had a feeling we'd better
> > still consider >4 level hosts some day very soon..  No strong opinion, as
> > long as this test is not run by default.
> 
> I had a feeling that when I said N level I actually meant N-1 level in all
> above, since 39 bits are for 3 level not 4 level?..
> 
> Then it's ~512GB pgtables on 5 level?  If so I do think we'd better have a
> nicer way to do this identity mapping..

Agreed, mapping all theoretically possible gfns into L2 is doomed to fail for
larger MAXPHYADDR systems.

Page table allocations are currently hardcoded to come from memslot0.  memslot0
is required to be in lower DRAM, and thus tops out at ~3gb for all intents and
purposes because we need to leave room for the xAPIC.

And I would strongly prefer not to plumb back the ability to specificy an alternative
memslot for page table allocations, because except for truly pathological tests that
functionality is unnecessary and pointless complexity.

> I don't think it's very hard - walk the mem regions in kvm_vm.regions
> should work for us?

Yeah.  Alternatively, The test can identity map all of memory <4gb and then also
map "guest_test_phys_mem - guest_num_pages".  I don't think there's any other memory
to deal with, is there?
David Matlack May 18, 2022, 4:12 p.m. UTC | #4
On Wed, May 18, 2022 at 8:24 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, May 18, 2022, Peter Xu wrote:
> > On Tue, May 17, 2022 at 04:20:31PM -0400, Peter Xu wrote:
> > > On Tue, May 17, 2022 at 07:05:24PM +0000, David Matlack wrote:
> > > > +uint64_t perf_test_nested_pages(int nr_vcpus)
> > > > +{
> > > > + /*
> > > > +  * 513 page tables to identity-map the L2 with 1G pages, plus a few
> > > > +  * pages per-vCPU for data structures such as the VMCS.
> > > > +  */
> > > > + return 513 + 10 * nr_vcpus;
> > >
> > > Shouldn't that 513 magic value be related to vm->max_gfn instead (rather
> > > than assuming all hosts have 39 bits PA)?
> > >
> > > If my math is correct, it'll require 1GB here just for the l2->l1 pgtables
> > > on a 5-level host to run this test nested. So I had a feeling we'd better
> > > still consider >4 level hosts some day very soon..  No strong opinion, as
> > > long as this test is not run by default.
> >
> > I had a feeling that when I said N level I actually meant N-1 level in all
> > above, since 39 bits are for 3 level not 4 level?..
> >
> > Then it's ~512GB pgtables on 5 level?  If so I do think we'd better have a
> > nicer way to do this identity mapping..
>
> Agreed, mapping all theoretically possible gfns into L2 is doomed to fail for
> larger MAXPHYADDR systems.

Peter, I think your original math was correct. For 4-level we need 1
L4 + 512 L3 tables (i.e. ~2MiB) to map the entire address space. Each
of the L3 tables contains 512 PTEs that each points to a 1GiB page,
mapping in total 512 * 512 = 256 TiBd.

So for 5-level we need 1 L5 + 512 L4 + 262144 L3 table (i.e. ~1GiB).

>
> Page table allocations are currently hardcoded to come from memslot0.  memslot0
> is required to be in lower DRAM, and thus tops out at ~3gb for all intents and
> purposes because we need to leave room for the xAPIC.
>
> And I would strongly prefer not to plumb back the ability to specificy an alternative
> memslot for page table allocations, because except for truly pathological tests that
> functionality is unnecessary and pointless complexity.
>
> > I don't think it's very hard - walk the mem regions in kvm_vm.regions
> > should work for us?
>
> Yeah.  Alternatively, The test can identity map all of memory <4gb and then also
> map "guest_test_phys_mem - guest_num_pages".  I don't think there's any other memory
> to deal with, is there?

This isn't necessary for 4-level, but also wouldn't be too hard to
implement. I can take a stab at implementing in v3 if we think 5-level
selftests are coming soon.
Sean Christopherson May 18, 2022, 4:37 p.m. UTC | #5
On Wed, May 18, 2022, David Matlack wrote:
> On Wed, May 18, 2022 at 8:24 AM Sean Christopherson <seanjc@google.com> wrote:
> > Page table allocations are currently hardcoded to come from memslot0.  memslot0
> > is required to be in lower DRAM, and thus tops out at ~3gb for all intents and
> > purposes because we need to leave room for the xAPIC.
> >
> > And I would strongly prefer not to plumb back the ability to specificy an alternative
> > memslot for page table allocations, because except for truly pathological tests that
> > functionality is unnecessary and pointless complexity.
> >
> > > I don't think it's very hard - walk the mem regions in kvm_vm.regions
> > > should work for us?
> >
> > Yeah.  Alternatively, The test can identity map all of memory <4gb and then also
> > map "guest_test_phys_mem - guest_num_pages".  I don't think there's any other memory
> > to deal with, is there?
> 
> This isn't necessary for 4-level, but also wouldn't be too hard to
> implement. I can take a stab at implementing in v3 if we think 5-level
> selftests are coming soon.

The current incarnation of nested_map_all_1g() is broken irrespective of 5-level
paging.  If MAXPHYADDR > 48, then bits 51:48 will either be ignored or will cause
reserved #PF or #GP[*].  Because the test puts memory at max_gfn, identity mapping
test memory will fail if 4-level paging is used and MAXPHYADDR > 48.

I think the easist thing would be to restrict the "starting" upper gfn to the min
of max_gfn and the max addressable gfn based on whether 4-level or 5-level paging
is in use.

[*] Intel's SDM is comically out-of-date and pretends 5-level EPT doesn't exist,
    so I'm not sure what happens if a GPA is greater than the PWL.

    Section "28.3.2 EPT Translation Mechanism" still says:

    The EPT translation mechanism uses only bits 47:0 of each guest-physical address.

    No processors supporting the Intel 64 architecture support more than 48
    physical-address bits. Thus, no such processor can produce a guest-physical
    address with more than 48 bits. An attempt to use such an address causes a
    page fault. An attempt to load CR3 with such an address causes a general-protection
    fault. If PAE paging is being used, an attempt to load CR3 that would load a
    PDPTE with such an address causes a general-protection fault.
David Matlack May 20, 2022, 10:01 p.m. UTC | #6
On Wed, May 18, 2022 at 9:37 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, May 18, 2022, David Matlack wrote:
> > On Wed, May 18, 2022 at 8:24 AM Sean Christopherson <seanjc@google.com> wrote:
> > > Page table allocations are currently hardcoded to come from memslot0.  memslot0
> > > is required to be in lower DRAM, and thus tops out at ~3gb for all intents and
> > > purposes because we need to leave room for the xAPIC.
> > >
> > > And I would strongly prefer not to plumb back the ability to specificy an alternative
> > > memslot for page table allocations, because except for truly pathological tests that
> > > functionality is unnecessary and pointless complexity.
> > >
> > > > I don't think it's very hard - walk the mem regions in kvm_vm.regions
> > > > should work for us?
> > >
> > > Yeah.  Alternatively, The test can identity map all of memory <4gb and then also
> > > map "guest_test_phys_mem - guest_num_pages".  I don't think there's any other memory
> > > to deal with, is there?
> >
> > This isn't necessary for 4-level, but also wouldn't be too hard to
> > implement. I can take a stab at implementing in v3 if we think 5-level
> > selftests are coming soon.
>
> The current incarnation of nested_map_all_1g() is broken irrespective of 5-level
> paging.  If MAXPHYADDR > 48, then bits 51:48 will either be ignored or will cause
> reserved #PF or #GP[*].  Because the test puts memory at max_gfn, identity mapping
> test memory will fail if 4-level paging is used and MAXPHYADDR > 48.

Ah good point.

I wasn't able to get a machine with MAXPHYADDR > 48 to test today so
I've just made __nested_pg_map() assert that the nested_paddr fits in
48 bits. We can add the support for 5-level paging or your idea to
restrict the perf_test_util gfn to 48-bits in a subsequent series when
it becomes necessary.

>
> I think the easist thing would be to restrict the "starting" upper gfn to the min
> of max_gfn and the max addressable gfn based on whether 4-level or 5-level paging
> is in use.
>
> [*] Intel's SDM is comically out-of-date and pretends 5-level EPT doesn't exist,
>     so I'm not sure what happens if a GPA is greater than the PWL.
>
>     Section "28.3.2 EPT Translation Mechanism" still says:
>
>     The EPT translation mechanism uses only bits 47:0 of each guest-physical address.
>
>     No processors supporting the Intel 64 architecture support more than 48
>     physical-address bits. Thus, no such processor can produce a guest-physical
>     address with more than 48 bits. An attempt to use such an address causes a
>     page fault. An attempt to load CR3 with such an address causes a general-protection
>     fault. If PAE paging is being used, an attempt to load CR3 that would load a
>     PDPTE with such an address causes a general-protection fault.
David Matlack May 20, 2022, 10:49 p.m. UTC | #7
On Fri, May 20, 2022 at 3:01 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, May 18, 2022 at 9:37 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, May 18, 2022, David Matlack wrote:
> > > On Wed, May 18, 2022 at 8:24 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > Page table allocations are currently hardcoded to come from memslot0.  memslot0
> > > > is required to be in lower DRAM, and thus tops out at ~3gb for all intents and
> > > > purposes because we need to leave room for the xAPIC.
> > > >
> > > > And I would strongly prefer not to plumb back the ability to specificy an alternative
> > > > memslot for page table allocations, because except for truly pathological tests that
> > > > functionality is unnecessary and pointless complexity.
> > > >
> > > > > I don't think it's very hard - walk the mem regions in kvm_vm.regions
> > > > > should work for us?
> > > >
> > > > Yeah.  Alternatively, The test can identity map all of memory <4gb and then also
> > > > map "guest_test_phys_mem - guest_num_pages".  I don't think there's any other memory
> > > > to deal with, is there?
> > >
> > > This isn't necessary for 4-level, but also wouldn't be too hard to
> > > implement. I can take a stab at implementing in v3 if we think 5-level
> > > selftests are coming soon.
> >
> > The current incarnation of nested_map_all_1g() is broken irrespective of 5-level
> > paging.  If MAXPHYADDR > 48, then bits 51:48 will either be ignored or will cause
> > reserved #PF or #GP[*].  Because the test puts memory at max_gfn, identity mapping
> > test memory will fail if 4-level paging is used and MAXPHYADDR > 48.
>
> Ah good point.
>
> I wasn't able to get a machine with MAXPHYADDR > 48 to test today so
> I've just made __nested_pg_map() assert that the nested_paddr fits in
> 48 bits. We can add the support for 5-level paging or your idea to
> restrict the perf_test_util gfn to 48-bits in a subsequent series when
> it becomes necessary.

Nevermind I've got a machine to test on now. I'll have a v4 out in a
few minutes to address MAXPHYADDR > 48 hosts. In the meantime I've
confirmed that the new assert in __nested_pg_map() works as expected
:)

>
> >
> > I think the easist thing would be to restrict the "starting" upper gfn to the min
> > of max_gfn and the max addressable gfn based on whether 4-level or 5-level paging
> > is in use.
> >
> > [*] Intel's SDM is comically out-of-date and pretends 5-level EPT doesn't exist,
> >     so I'm not sure what happens if a GPA is greater than the PWL.
> >
> >     Section "28.3.2 EPT Translation Mechanism" still says:
> >
> >     The EPT translation mechanism uses only bits 47:0 of each guest-physical address.
> >
> >     No processors supporting the Intel 64 architecture support more than 48
> >     physical-address bits. Thus, no such processor can produce a guest-physical
> >     address with more than 48 bits. An attempt to use such an address causes a
> >     page fault. An attempt to load CR3 with such an address causes a general-protection
> >     fault. If PAE paging is being used, an attempt to load CR3 that would load a
> >     PDPTE with such an address causes a general-protection fault.
diff mbox series

Patch

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 83b9ffa456ea..42cb904f6e54 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -49,6 +49,7 @@  LIBKVM += lib/test_util.c
 
 LIBKVM_x86_64 += lib/x86_64/apic.c
 LIBKVM_x86_64 += lib/x86_64/handlers.S
+LIBKVM_x86_64 += lib/x86_64/perf_test_util.c
 LIBKVM_x86_64 += lib/x86_64/processor.c
 LIBKVM_x86_64 += lib/x86_64/svm.c
 LIBKVM_x86_64 += lib/x86_64/ucall.c
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 7b47ae4f952e..d60a34cdfaee 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -336,8 +336,8 @@  static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-i iterations] [-p offset] [-g]"
-	       "[-m mode] [-b vcpu bytes] [-v vcpus] [-o] [-s mem type]"
+	printf("usage: %s [-h] [-i iterations] [-p offset] [-g] "
+	       "[-m mode] [-n] [-b vcpu bytes] [-v vcpus] [-o] [-s mem type]"
 	       "[-x memslots]\n", name);
 	puts("");
 	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
@@ -351,6 +351,7 @@  static void help(char *name)
 	printf(" -p: specify guest physical test memory offset\n"
 	       "     Warning: a low offset can conflict with the loaded test code.\n");
 	guest_modes_help();
+	printf(" -n: Run the vCPUs in nested mode (L2)\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     dirtied by each vCPU. e.g. 10M or 3G.\n"
 	       "     (default: 1G)\n");
@@ -387,7 +388,7 @@  int main(int argc, char *argv[])
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ghi:p:m:b:f:v:os:x:")) != -1) {
+	while ((opt = getopt(argc, argv, "ghi:p:m:nb:f:v:os:x:")) != -1) {
 		switch (opt) {
 		case 'g':
 			dirty_log_manual_caps = 0;
@@ -401,6 +402,9 @@  int main(int argc, char *argv[])
 		case 'm':
 			guest_modes_cmdline(optarg);
 			break;
+		case 'n':
+			perf_test_args.nested = true;
+			break;
 		case 'b':
 			guest_percpu_mem_size = parse_size(optarg);
 			break;
diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
index a86f953d8d36..b6c1770ab831 100644
--- a/tools/testing/selftests/kvm/include/perf_test_util.h
+++ b/tools/testing/selftests/kvm/include/perf_test_util.h
@@ -34,6 +34,9 @@  struct perf_test_args {
 	uint64_t guest_page_size;
 	int wr_fract;
 
+	/* Run vCPUs in L2 instead of L1, if the architecture supports it. */
+	bool nested;
+
 	struct perf_test_vcpu_args vcpu_args[KVM_MAX_VCPUS];
 };
 
@@ -49,5 +52,9 @@  void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract);
 
 void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *));
 void perf_test_join_vcpu_threads(int vcpus);
+void perf_test_guest_code(uint32_t vcpu_id);
+
+uint64_t perf_test_nested_pages(int nr_vcpus);
+void perf_test_setup_nested(struct kvm_vm *vm, int nr_vcpus);
 
 #endif /* SELFTEST_KVM_PERF_TEST_UTIL_H */
diff --git a/tools/testing/selftests/kvm/include/x86_64/vmx.h b/tools/testing/selftests/kvm/include/x86_64/vmx.h
index 3b1794baa97c..17d712503a36 100644
--- a/tools/testing/selftests/kvm/include/x86_64/vmx.h
+++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h
@@ -96,6 +96,7 @@ 
 #define VMX_MISC_PREEMPTION_TIMER_RATE_MASK	0x0000001f
 #define VMX_MISC_SAVE_EFER_LMA			0x00000020
 
+#define VMX_EPT_VPID_CAP_1G_PAGES		0x00020000
 #define VMX_EPT_VPID_CAP_AD_BITS		0x00200000
 
 #define EXIT_REASON_FAILED_VMENTRY	0x80000000
@@ -608,6 +609,7 @@  bool load_vmcs(struct vmx_pages *vmx);
 
 bool nested_vmx_supported(void);
 void nested_vmx_check_supported(void);
+bool ept_1g_pages_supported(void);
 
 void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm,
 		   uint64_t nested_paddr, uint64_t paddr);
@@ -615,6 +617,7 @@  void nested_map(struct vmx_pages *vmx, struct kvm_vm *vm,
 		 uint64_t nested_paddr, uint64_t paddr, uint64_t size);
 void nested_map_memslot(struct vmx_pages *vmx, struct kvm_vm *vm,
 			uint32_t memslot);
+void nested_map_all_1g(struct vmx_pages *vmx, struct kvm_vm *vm);
 void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm,
 		  uint32_t eptp_memslot);
 void prepare_virtualize_apic_accesses(struct vmx_pages *vmx, struct kvm_vm *vm);
diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
index 722df3a28791..530be01706d5 100644
--- a/tools/testing/selftests/kvm/lib/perf_test_util.c
+++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
@@ -40,7 +40,7 @@  static bool all_vcpu_threads_running;
  * Continuously write to the first 8 bytes of each page in the
  * specified region.
  */
-static void guest_code(uint32_t vcpu_id)
+void perf_test_guest_code(uint32_t vcpu_id)
 {
 	struct perf_test_args *pta = &perf_test_args;
 	struct perf_test_vcpu_args *vcpu_args = &pta->vcpu_args[vcpu_id];
@@ -108,7 +108,7 @@  struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
 {
 	struct perf_test_args *pta = &perf_test_args;
 	struct kvm_vm *vm;
-	uint64_t guest_num_pages;
+	uint64_t guest_num_pages, slot0_pages = DEFAULT_GUEST_PHY_PAGES;
 	uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src);
 	int i;
 
@@ -134,13 +134,20 @@  struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
 		    "Guest memory cannot be evenly divided into %d slots.",
 		    slots);
 
+	/*
+	 * If using nested, allocate extra pages for the nested page tables and
+	 * in-memory data structures.
+	 */
+	if (pta->nested)
+		slot0_pages += perf_test_nested_pages(vcpus);
+
 	/*
 	 * Pass guest_num_pages to populate the page tables for test memory.
 	 * The memory is also added to memslot 0, but that's a benign side
 	 * effect as KVM allows aliasing HVAs in meslots.
 	 */
-	vm = vm_create_with_vcpus(mode, vcpus, DEFAULT_GUEST_PHY_PAGES,
-				  guest_num_pages, 0, guest_code, NULL);
+	vm = vm_create_with_vcpus(mode, vcpus, slot0_pages, guest_num_pages, 0,
+				  perf_test_guest_code, NULL);
 
 	pta->vm = vm;
 
@@ -178,6 +185,9 @@  struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
 
 	perf_test_setup_vcpus(vm, vcpus, vcpu_memory_bytes, partition_vcpu_memory_access);
 
+	if (pta->nested)
+		perf_test_setup_nested(vm, vcpus);
+
 	ucall_init(vm, NULL);
 
 	/* Export the shared variables to the guest. */
@@ -198,6 +208,17 @@  void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract)
 	sync_global_to_guest(vm, perf_test_args);
 }
 
+uint64_t __weak perf_test_nested_pages(int nr_vcpus)
+{
+	return 0;
+}
+
+void __weak perf_test_setup_nested(struct kvm_vm *vm, int nr_vcpus)
+{
+	pr_info("%s() not support on this architecture, skipping.\n", __func__);
+	exit(KSFT_SKIP);
+}
+
 static void *vcpu_thread_main(void *data)
 {
 	struct vcpu_thread *vcpu = data;
diff --git a/tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c b/tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c
new file mode 100644
index 000000000000..472e7d5a182b
--- /dev/null
+++ b/tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c
@@ -0,0 +1,98 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * x86_64-specific extensions to perf_test_util.c.
+ *
+ * Copyright (C) 2022, Google, Inc.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "perf_test_util.h"
+#include "../kvm_util_internal.h"
+#include "processor.h"
+#include "vmx.h"
+
+void perf_test_l2_guest_code(uint64_t vcpu_id)
+{
+	perf_test_guest_code(vcpu_id);
+	vmcall();
+}
+
+extern char perf_test_l2_guest_entry[];
+__asm__(
+"perf_test_l2_guest_entry:"
+"	mov (%rsp), %rdi;"
+"	call perf_test_l2_guest_code;"
+"	ud2;"
+);
+
+static void perf_test_l1_guest_code(struct vmx_pages *vmx, uint64_t vcpu_id)
+{
+#define L2_GUEST_STACK_SIZE 64
+	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+	unsigned long *rsp;
+
+	GUEST_ASSERT(vmx->vmcs_gpa);
+	GUEST_ASSERT(prepare_for_vmx_operation(vmx));
+	GUEST_ASSERT(load_vmcs(vmx));
+	GUEST_ASSERT(ept_1g_pages_supported());
+
+	rsp = &l2_guest_stack[L2_GUEST_STACK_SIZE - 1];
+	*rsp = vcpu_id;
+	prepare_vmcs(vmx, perf_test_l2_guest_entry, rsp);
+
+	GUEST_ASSERT(!vmlaunch());
+	GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL);
+	GUEST_DONE();
+}
+
+uint64_t perf_test_nested_pages(int nr_vcpus)
+{
+	/*
+	 * 513 page tables to identity-map the L2 with 1G pages, plus a few
+	 * pages per-vCPU for data structures such as the VMCS.
+	 */
+	return 513 + 10 * nr_vcpus;
+}
+
+void perf_test_setup_nested(struct kvm_vm *vm, int nr_vcpus)
+{
+	struct vmx_pages *vmx, *vmx0 = NULL;
+	struct kvm_regs regs;
+	vm_vaddr_t vmx_gva;
+	int vcpu_id;
+
+	nested_vmx_check_supported();
+
+	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
+		vmx = vcpu_alloc_vmx(vm, &vmx_gva);
+
+		if (vcpu_id == 0) {
+			prepare_eptp(vmx, vm, 0);
+			/*
+			 * Identity map L2 with 1G pages so that KVM can shadow
+			 * the EPT12 with huge pages.
+			 */
+			nested_map_all_1g(vmx, vm);
+			vmx0 = vmx;
+		} else {
+			/* Share the same EPT table across all vCPUs. */
+			vmx->eptp = vmx0->eptp;
+			vmx->eptp_hva = vmx0->eptp_hva;
+			vmx->eptp_gpa = vmx0->eptp_gpa;
+		}
+
+		/*
+		 * Override the vCPU to run perf_test_l1_guest_code() which will
+		 * bounce it into L2 before calling perf_test_guest_code().
+		 */
+		vcpu_regs_get(vm, vcpu_id, &regs);
+		regs.rip = (unsigned long) perf_test_l1_guest_code;
+		vcpu_regs_set(vm, vcpu_id, &regs);
+		vcpu_args_set(vm, vcpu_id, 2, vmx_gva, vcpu_id);
+	}
+}
diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
index 5bf169179455..9858e56370cb 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
@@ -203,6 +203,11 @@  static bool ept_vpid_cap_supported(uint64_t mask)
 	return rdmsr(MSR_IA32_VMX_EPT_VPID_CAP) & mask;
 }
 
+bool ept_1g_pages_supported(void)
+{
+	return ept_vpid_cap_supported(VMX_EPT_VPID_CAP_1G_PAGES);
+}
+
 /*
  * Initialize the control fields to the most basic settings possible.
  */
@@ -547,6 +552,14 @@  void nested_map_memslot(struct vmx_pages *vmx, struct kvm_vm *vm,
 	}
 }
 
+/* Identity map the entire guest physical address space with 1GiB Pages. */
+void nested_map_all_1g(struct vmx_pages *vmx, struct kvm_vm *vm)
+{
+	uint64_t gpa_size = (vm->max_gfn + 1) << vm->page_shift;
+
+	__nested_map(vmx, vm, 0, 0, gpa_size, PG_LEVEL_1G);
+}
+
 void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm,
 		  uint32_t eptp_memslot)
 {