diff mbox series

KVM: selftests: Create memslot 0 at GPA 0x100000000 on x86_64

Message ID 20240307194255.1367442-1-dmatlack@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: selftests: Create memslot 0 at GPA 0x100000000 on x86_64 | expand

Commit Message

David Matlack March 7, 2024, 7:42 p.m. UTC
Create memslot 0 at 0x100000000 (4GiB) to avoid it overlapping with
KVM's private memslot for the APIC-access page.

Without this change running large-enough selftests (high number of vCPUs
and/or guest memory) can result in KVM_CREATE_VCPU failing because the
APIC-access page memslot overlaps memslot 0 created by the selftest:

  $ ./dirty_log_perf_test -v 256 -s anonymous_hugetlb_1gb -b 4G
  Test iterations: 2
  Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
  ==== Test Assertion Failure ====
    lib/kvm_util.c:1341: vcpu->fd >= 0
    pid=53654 tid=53654 errno=17 - File exists
    (stack trace empty)
    KVM_CREATE_VCPU failed, rc: -1 errno: 17 (File exists)

Signed-off-by: David Matlack <dmatlack@google.com>
---
 tools/testing/selftests/kvm/include/x86_64/processor.h | 6 ++++++
 tools/testing/selftests/kvm/lib/kvm_util.c             | 7 ++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)


base-commit: 0c64952fec3ea01cb5b09f00134200f3e7ab40d5

Comments

Sean Christopherson March 7, 2024, 10:37 p.m. UTC | #1
On Thu, Mar 07, 2024, David Matlack wrote:
> Create memslot 0 at 0x100000000 (4GiB) to avoid it overlapping with
> KVM's private memslot for the APIC-access page.

This is going to cause other problems, e.g. from max_guest_memory_test.c

	/*
	 * Skip the first 4gb and slot0.  slot0 maps <1gb and is used to back
	 * the guest's code, stack, and page tables.  Because selftests creates
	 * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot
	 * just below the 4gb boundary.  This test could create memory at
	 * 1gb-3gb,but it's simpler to skip straight to 4gb.
	 */
	const uint64_t start_gpa = SZ_4G;

Trying to move away from starting at '0' is going to be problematic/annoying,
e.g. using low memory allows tests to safely assume 4GiB+ is always available.
And I'd prefer not to make the infrastucture all twisty and weird for all tests
just because memstress tests want to play with huge amounts of memory.

Any chance we can solve this by using huge pages in the guest, and adjusting the
gorilla math in vm_nr_pages_required() accordingly?  There's really no reason to
use 4KiB pages for a VM with 256GiB of memory.  That'd also be more represantitive
of real world workloads (at least, I hope real world workloads are using 2MiB or
1GiB pages in this case).
David Matlack March 7, 2024, 11:27 p.m. UTC | #2
On 2024-03-07 02:37 PM, Sean Christopherson wrote:
> On Thu, Mar 07, 2024, David Matlack wrote:
> > Create memslot 0 at 0x100000000 (4GiB) to avoid it overlapping with
> > KVM's private memslot for the APIC-access page.
> 
> This is going to cause other problems, e.g. from max_guest_memory_test.c
> 
> 	/*
> 	 * Skip the first 4gb and slot0.  slot0 maps <1gb and is used to back
> 	 * the guest's code, stack, and page tables.  Because selftests creates
> 	 * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot
> 	 * just below the 4gb boundary.  This test could create memory at
> 	 * 1gb-3gb,but it's simpler to skip straight to 4gb.
> 	 */
> 	const uint64_t start_gpa = SZ_4G;
> 
> Trying to move away from starting at '0' is going to be problematic/annoying,
> e.g. using low memory allows tests to safely assume 4GiB+ is always available.
> And I'd prefer not to make the infrastucture all twisty and weird for all tests
> just because memstress tests want to play with huge amounts of memory.
> 
> Any chance we can solve this by using huge pages in the guest, and adjusting the
> gorilla math in vm_nr_pages_required() accordingly?  There's really no reason to
> use 4KiB pages for a VM with 256GiB of memory.  That'd also be more represantitive
> of real world workloads (at least, I hope real world workloads are using 2MiB or
> 1GiB pages in this case).

There are real world workloads that use TiB of RAM with 4KiB mappings
(looking at you SAP HANA).

What about giving tests an explicit "start" GPA they can use? That would
fix max_guest_memory_test and avoid tests making assumptions about 4GiB
being a magically safe address to use.

e.g. Something like this on top:

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 194963e05341..584ac6fea65c 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -101,6 +101,11 @@ struct kvm_vm {
 	unsigned int page_shift;
 	unsigned int pa_bits;
 	unsigned int va_bits;
+	/*
+	 * Tests are able to use the guest physical address space from
+	 * [available_base_gpa, max_gfn << page_shift) for their own purposes.
+	 */
+	vm_paddr_t available_base_gpa;
 	uint64_t max_gfn;
 	struct list_head vcpus;
 	struct userspace_mem_regions regions;
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index c8d7e66d308d..e74d9efa82c2 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -17,6 +17,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 #include <linux/kernel.h>
+#include <linux/sizes.h>
 
 #define KVM_UTIL_MIN_PFN	2
 
@@ -414,6 +415,7 @@ struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
 	uint64_t nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
 						 nr_extra_pages);
 	struct userspace_mem_region *slot0;
+	vm_paddr_t ucall_mmio_gpa;
 	struct kvm_vm *vm;
 	int i;
 
@@ -436,7 +438,15 @@ struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
 	 * MMIO region would prevent silently clobbering the MMIO region.
 	 */
 	slot0 = memslot2region(vm, 0);
-	ucall_init(vm, slot0->region.guest_phys_addr + slot0->region.memory_size);
+	ucall_mmio_gpa = slot0->region.guest_phys_addr + slot0->region.memory_size;
+	ucall_init(vm, ucall_mmio_gpa);
+
+	/*
+	 * 1GiB is somewhat arbitrary, but is chosen to be large enough to meet
+	 * most tests' alignment requirements/expectations.
+	 */
+	vm->available_base_gpa =
+		SZ_1G * DIV_ROUND_UP(ucall_mmio_gpa + vm->page_size, SZ_1G);
 
 	kvm_arch_vm_post_create(vm);
 
diff --git a/tools/testing/selftests/kvm/max_guest_memory_test.c b/tools/testing/selftests/kvm/max_guest_memory_test.c
index 6628dc4dda89..f5d77d2a903d 100644
--- a/tools/testing/selftests/kvm/max_guest_memory_test.c
+++ b/tools/testing/selftests/kvm/max_guest_memory_test.c
@@ -156,18 +156,10 @@ static void calc_default_nr_vcpus(void)
 
 int main(int argc, char *argv[])
 {
-	/*
-	 * Skip the first 4gb and slot0.  slot0 maps <1gb and is used to back
-	 * the guest's code, stack, and page tables.  Because selftests creates
-	 * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot
-	 * just below the 4gb boundary.  This test could create memory at
-	 * 1gb-3gb,but it's simpler to skip straight to 4gb.
-	 */
-	const uint64_t start_gpa = SZ_4G;
 	const int first_slot = 1;
 
 	struct timespec time_start, time_run1, time_reset, time_run2;
-	uint64_t max_gpa, gpa, slot_size, max_mem, i;
+	uint64_t start_gpa, max_gpa, gpa, slot_size, max_mem, i;
 	int max_slots, slot, opt, fd;
 	bool hugepages = false;
 	struct kvm_vcpu **vcpus;
@@ -229,6 +221,7 @@ int main(int argc, char *argv[])
 	for (i = 0; i < slot_size; i += vm->page_size)
 		((uint8_t *)mem)[i] = 0xaa;
 
+	start_gpa = vm->available_base_gpa;
 	gpa = 0;
 	for (slot = first_slot; slot < max_slots; slot++) {
 		gpa = start_gpa + ((slot - first_slot) * slot_size);
David Matlack March 7, 2024, 11:53 p.m. UTC | #3
On Thu, Mar 7, 2024 at 3:27 PM David Matlack <dmatlack@google.com> wrote:
>
> On 2024-03-07 02:37 PM, Sean Christopherson wrote:
> > On Thu, Mar 07, 2024, David Matlack wrote:
> > > Create memslot 0 at 0x100000000 (4GiB) to avoid it overlapping with
> > > KVM's private memslot for the APIC-access page.
> >
> > This is going to cause other problems, e.g. from max_guest_memory_test.c
> >
> >       /*
> >        * Skip the first 4gb and slot0.  slot0 maps <1gb and is used to back
> >        * the guest's code, stack, and page tables.  Because selftests creates
> >        * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot
> >        * just below the 4gb boundary.  This test could create memory at
> >        * 1gb-3gb,but it's simpler to skip straight to 4gb.
> >        */
> >       const uint64_t start_gpa = SZ_4G;
> >
> > Trying to move away from starting at '0' is going to be problematic/annoying,
> > e.g. using low memory allows tests to safely assume 4GiB+ is always available.
> > And I'd prefer not to make the infrastucture all twisty and weird for all tests
> > just because memstress tests want to play with huge amounts of memory.
> >
> > Any chance we can solve this by using huge pages in the guest, and adjusting the
> > gorilla math in vm_nr_pages_required() accordingly?  There's really no reason to
> > use 4KiB pages for a VM with 256GiB of memory.  That'd also be more represantitive
> > of real world workloads (at least, I hope real world workloads are using 2MiB or
> > 1GiB pages in this case).
>
> There are real world workloads that use TiB of RAM with 4KiB mappings
> (looking at you SAP HANA).
>
> What about giving tests an explicit "start" GPA they can use? That would
> fix max_guest_memory_test and avoid tests making assumptions about 4GiB
> being a magically safe address to use.
>
> e.g. Something like this on top:

Gah, I missed nx_huge_page_test.c, which needs similar changes to
max_memory_test.c.

Also if you prefer the "start" address be a compile time constant we
can pick an arbitrary number above 4GiB and use that (e.g. start=32GiB
would be more than enough for a 12TiB guest).
Sean Christopherson March 13, 2024, 2:31 p.m. UTC | #4
On Thu, Mar 07, 2024, David Matlack wrote:
> On Thu, Mar 7, 2024 at 3:27 PM David Matlack <dmatlack@google.com> wrote:
> >
> > On 2024-03-07 02:37 PM, Sean Christopherson wrote:
> > > On Thu, Mar 07, 2024, David Matlack wrote:
> > > > Create memslot 0 at 0x100000000 (4GiB) to avoid it overlapping with
> > > > KVM's private memslot for the APIC-access page.
> > >
> > > This is going to cause other problems, e.g. from max_guest_memory_test.c
> > >
> > >       /*
> > >        * Skip the first 4gb and slot0.  slot0 maps <1gb and is used to back
> > >        * the guest's code, stack, and page tables.  Because selftests creates
> > >        * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot
> > >        * just below the 4gb boundary.  This test could create memory at
> > >        * 1gb-3gb,but it's simpler to skip straight to 4gb.
> > >        */
> > >       const uint64_t start_gpa = SZ_4G;
> > >
> > > Trying to move away from starting at '0' is going to be problematic/annoying,
> > > e.g. using low memory allows tests to safely assume 4GiB+ is always available.
> > > And I'd prefer not to make the infrastucture all twisty and weird for all tests
> > > just because memstress tests want to play with huge amounts of memory.
> > >
> > > Any chance we can solve this by using huge pages in the guest, and adjusting the
> > > gorilla math in vm_nr_pages_required() accordingly?  There's really no reason to
> > > use 4KiB pages for a VM with 256GiB of memory.  That'd also be more represantitive
> > > of real world workloads (at least, I hope real world workloads are using 2MiB or
> > > 1GiB pages in this case).
> >
> > There are real world workloads that use TiB of RAM with 4KiB mappings
> > (looking at you SAP HANA).
> >
> > What about giving tests an explicit "start" GPA they can use? That would
> > fix max_guest_memory_test and avoid tests making assumptions about 4GiB
> > being a magically safe address to use.
> >
> > e.g. Something like this on top:
> 
> Gah, I missed nx_huge_page_test.c, which needs similar changes to
> max_memory_test.c.
> 
> Also if you prefer the "start" address be a compile time constant we
> can pick an arbitrary number above 4GiB and use that (e.g. start=32GiB
> would be more than enough for a 12TiB guest).

Aha!  Idea.  I think we can clean up multiple warts at once.

The underlying problem is the memory that is allocated for guest page tables.
The core code allocation is constant for any given test, and a complete non-issue
unless someone writes a *really* crazy test.  And while core data (stack) allocations
scale with the number of vCPUs, they are (a) articifically bounded by the maximum
number of vCPUs and (b) relatively small allocations (a few pages per vCPU).

And for page table allocations, we already have this absurd magic value:

  #define KVM_GUEST_PAGE_TABLE_MIN_PADDR	0x180000

which I'm guessing exists to avoid clobbering code+stack allocations, but that's
an irrelevant tangent.

The other asset we now have is vm->memslots[NR_MEM_REGIONS], and more specifically
allocations for guest page tables are done via vm->memslots[MEM_REGION_PT].

So, rather than more hardcoded addresses and/or a knob to control _all_ code
allocations, I think we should provide knob to say that MEM_REGION_PT should go
to memory above 4GiB.  And to make memslot handling maintainable in the long term:

  1. Add a knob to place MEM_REGION_PT at 4GiB (and as of this initial patch,
     conditionally in their own memslot).

  2. Use the PT_AT_4GIB (not the real name) knob for the various memstress tests
     that need it.

  3. Formalize memslots 0..2 (CODE, DATA, and PT) as being owned by the library,
     with memslots 3..MAX available for test usage.

  4. Modify tests that assume memslots 1..MAX are available, i.e. force them to
     start at MEM_REGION_TEST_DATA.

  5. Use separate memslots for CODE, DATA, and PT by default.  This will allow
     for more precise sizing of the CODE and DATA slots.

  6. Shrink the number of pages for CODE to a more reasonable number.  Currently
     vm_nr_pages_required() reserves 512 pages / 2MiB for per-VM assets, which
     at a glance seems ridiculously excessive.

  7. Use the PT_AT_4GIB knob in s390's CMMA test?  I suspect it does memslot
     shenanigans purely so that a low gfn (4096 in the test) is guaranteed to
     be available.

For #4, I think it's a fairly easy change.  E.g. set_memory_region_test.c, just
do s/MEM_REGION_SLOT/MEM_REGION_TEST_DATA.

And for max_guest_memory_test.c:

@@ -157,14 +154,14 @@ static void calc_default_nr_vcpus(void)
 int main(int argc, char *argv[])
 {
        /*
-        * Skip the first 4gb and slot0.  slot0 maps <1gb and is used to back
-        * the guest's code, stack, and page tables.  Because selftests creates
+        * Skip the first 4gb, which are reserved by the core library for the
+        * guest's code, stack, and page tables.  Because selftests creates
         * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot
         * just below the 4gb boundary.  This test could create memory at
         * 1gb-3gb,but it's simpler to skip straight to 4gb.
         */
        const uint64_t start_gpa = SZ_4G;
-       const int first_slot = 1;
+       const int first_slot = MEM_REGION_TEST_DATA;
 
        struct timespec time_start, time_run1, time_reset, time_run2;
        uint64_t max_gpa, gpa, slot_size, max_mem, i;
David Matlack March 14, 2024, 9:11 p.m. UTC | #5
On Wed, Mar 13, 2024 at 7:31 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Mar 07, 2024, David Matlack wrote:
> > On Thu, Mar 7, 2024 at 3:27 PM David Matlack <dmatlack@google.com> wrote:
> > >
> > > On 2024-03-07 02:37 PM, Sean Christopherson wrote:
> > > > On Thu, Mar 07, 2024, David Matlack wrote:
> > > > > Create memslot 0 at 0x100000000 (4GiB) to avoid it overlapping with
> > > > > KVM's private memslot for the APIC-access page.
> > > >
> > > > Any chance we can solve this by using huge pages in the guest, and adjusting the
> > > > gorilla math in vm_nr_pages_required() accordingly?  There's really no reason to
> > > > use 4KiB pages for a VM with 256GiB of memory.  That'd also be more represantitive
> > > > of real world workloads (at least, I hope real world workloads are using 2MiB or
> > > > 1GiB pages in this case).
> > >
> > > There are real world workloads that use TiB of RAM with 4KiB mappings
> > > (looking at you SAP HANA).
> > >
> > > What about giving tests an explicit "start" GPA they can use? That would
> > > fix max_guest_memory_test and avoid tests making assumptions about 4GiB
> > > being a magically safe address to use.
>
> So, rather than more hardcoded addresses and/or a knob to control _all_ code
> allocations, I think we should provide knob to say that MEM_REGION_PT should go
> to memory above 4GiB. And to make memslot handling maintainable in the long term:
>
>   1. Add a knob to place MEM_REGION_PT at 4GiB (and as of this initial patch,
>      conditionally in their own memslot).
>
>   2. Use the PT_AT_4GIB (not the real name) knob for the various memstress tests
>      that need it.

Making tests pick when to place page tables at 4GiB seems unnecessary.
Tests that don't otherwise need a specific physical memory layout
should be able to create a VM with any amount of memory and have it
just work.

It's also not impossible that a test has 4GiB+ .bss because the guest
needs a big array for something. In that case we'd need a knob to move
MEM_REGION_CODE above 4GiB on x86_64 as well.

For x86_64 (which is the only architecture AFAIK that has a private
memslot in KVM the framework can overlap with), what's the downside of
always putting all memslots above 4GiB?

>
>   3. Formalize memslots 0..2 (CODE, DATA, and PT) as being owned by the library,
>      with memslots 3..MAX available for test usage.
>
>   4. Modify tests that assume memslots 1..MAX are available, i.e. force them to
>      start at MEM_REGION_TEST_DATA.

I think MEM_REGION_TEST_DATA is just where the framework will satisfy
test-initiated dynamic memory allocations. That's different from which
slots are free for the test to use.

But assuming I understand your intention, I agree in spirit... Tests
should be allowed to use slots TEST_SLOT..MAX and physical addresses
TEST_GPA..MAX. The framework should provide both TEST_SLOT and
TEST_GPA (names pending), and existing tests should use those instead
of random hard-coded values.

>
>   5. Use separate memslots for CODE, DATA, and PT by default.  This will allow
>      for more precise sizing of the CODE and DATA slots.

What do you mean by "[separate memslots] will allow for more precise sizing"?

>
>   6. Shrink the number of pages for CODE to a more reasonable number.  Currently
>      vm_nr_pages_required() reserves 512 pages / 2MiB for per-VM assets, which
>      at a glance seems ridiculously excessive.
>
>   7. Use the PT_AT_4GIB knob in s390's CMMA test?  I suspect it does memslot
>      shenanigans purely so that a low gfn (4096 in the test) is guaranteed to
>      be available.

+Nico

Hm, if this test _needs_ to use GFN 4096, then maybe the framework can
give tests two regions 0..KVM_FRAMEWORK_GPA and TEST_GPA..MAX.

If the test just needs any GFN then it can use TEST_GPA instead of
4096 << page_shift.
Sean Christopherson March 14, 2024, 11:14 p.m. UTC | #6
On Thu, Mar 14, 2024, David Matlack wrote:
> On Wed, Mar 13, 2024 at 7:31 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Mar 07, 2024, David Matlack wrote:
> > > On Thu, Mar 7, 2024 at 3:27 PM David Matlack <dmatlack@google.com> wrote:
> > > >
> > > > On 2024-03-07 02:37 PM, Sean Christopherson wrote:
> > > > > On Thu, Mar 07, 2024, David Matlack wrote:
> > > > > > Create memslot 0 at 0x100000000 (4GiB) to avoid it overlapping with
> > > > > > KVM's private memslot for the APIC-access page.
> > > > >
> > > > > Any chance we can solve this by using huge pages in the guest, and adjusting the
> > > > > gorilla math in vm_nr_pages_required() accordingly?  There's really no reason to
> > > > > use 4KiB pages for a VM with 256GiB of memory.  That'd also be more represantitive
> > > > > of real world workloads (at least, I hope real world workloads are using 2MiB or
> > > > > 1GiB pages in this case).
> > > >
> > > > There are real world workloads that use TiB of RAM with 4KiB mappings
> > > > (looking at you SAP HANA).
> > > >
> > > > What about giving tests an explicit "start" GPA they can use? That would
> > > > fix max_guest_memory_test and avoid tests making assumptions about 4GiB
> > > > being a magically safe address to use.
> >
> > So, rather than more hardcoded addresses and/or a knob to control _all_ code
> > allocations, I think we should provide knob to say that MEM_REGION_PT should go
> > to memory above 4GiB. And to make memslot handling maintainable in the long term:
> >
> >   1. Add a knob to place MEM_REGION_PT at 4GiB (and as of this initial patch,
> >      conditionally in their own memslot).
> >
> >   2. Use the PT_AT_4GIB (not the real name) knob for the various memstress tests
> >      that need it.
> 
> Making tests pick when to place page tables at 4GiB seems unnecessary.
> Tests that don't otherwise need a specific physical memory layout
> should be able to create a VM with any amount of memory and have it
> just work.
> 
> It's also not impossible that a test has 4GiB+ .bss because the guest
> needs a big array for something. In that case we'd need a knob to move
> MEM_REGION_CODE above 4GiB on x86_64 as well.

LOL, at that point, the test can darn well dynamically allocate its memory.
Though I'd love to see a test that needs a 3GiB array :-)

> For x86_64 (which is the only architecture AFAIK that has a private
> memslot in KVM the framework can overlap with), what's the downside of
> always putting all memslots above 4GiB?

Divergence from other architectures, divergence from "real" VM configurations,
and a more compliciated programming environment for the vast majority of tests.
E.g. a test that uses more than ~3GiB of memory would need to dynamically place
its test specific memslots, whereas if the core library keeps everything under
4GiB by default, then on x86 every test knows it has 4GiB+ to play with.

One could argue that dynamically placing test specific would be more elegant,
but I'd prefer to avoid that because I can't think of any value it would add
from a test coverage perspective, and static addresses are much easier when it
comes to debug.

Having tests use e.g. 2GiB-3GiB or 1GiB-3GiB, would kinda sorta work, but that
2GiB limit isn't a trivial, e.g. max_guest_memory_test creates TiBs of memslots.

IMO, memstress is the odd one out, it should be the one that needs to do special
things.

> >   3. Formalize memslots 0..2 (CODE, DATA, and PT) as being owned by the library,
> >      with memslots 3..MAX available for test usage.
> >
> >   4. Modify tests that assume memslots 1..MAX are available, i.e. force them to
> >      start at MEM_REGION_TEST_DATA.
> 
> I think MEM_REGION_TEST_DATA is just where the framework will satisfy
> test-initiated dynamic memory allocations. That's different from which
> slots are free for the test to use.
> 
> But assuming I understand your intention, I agree in spirit... Tests
> should be allowed to use slots TEST_SLOT..MAX and physical addresses
> TEST_GPA..MAX. The framework should provide both TEST_SLOT and
> TEST_GPA (names pending), and existing tests should use those instead
> of random hard-coded values.
> 
> >
> >   5. Use separate memslots for CODE, DATA, and PT by default.  This will allow
> >      for more precise sizing of the CODE and DATA slots.
> 
> What do you mean by "[separate memslots] will allow for more precise sizing"?

I suspect there is a _lot_ of slop in the arbitrary 512 pages that are tacked on
by vm_nr_pages_required().  Those 2MiBs probably don't matter, I just don't like
completely magical numbers.

> >   6. Shrink the number of pages for CODE to a more reasonable number.  Currently
> >      vm_nr_pages_required() reserves 512 pages / 2MiB for per-VM assets, which
> >      at a glance seems ridiculously excessive.
> >
> >   7. Use the PT_AT_4GIB knob in s390's CMMA test?  I suspect it does memslot
> >      shenanigans purely so that a low gfn (4096 in the test) is guaranteed to
> >      be available.
> 
> +Nico
> 
> Hm, if this test _needs_ to use GFN 4096, then maybe the framework can
> give tests two regions 0..KVM_FRAMEWORK_GPA and TEST_GPA..MAX.
> 
> If the test just needs any GFN then it can use TEST_GPA instead of
> 4096 << page_shift.
David Matlack March 15, 2024, 4:16 p.m. UTC | #7
On Thu, Mar 14, 2024 at 4:14 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Mar 14, 2024, David Matlack wrote:
> > On Wed, Mar 13, 2024 at 7:31 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Mar 07, 2024, David Matlack wrote:
> > > > On Thu, Mar 7, 2024 at 3:27 PM David Matlack <dmatlack@google.com> wrote:
> > > > >
> > > > > On 2024-03-07 02:37 PM, Sean Christopherson wrote:
> > > > > > On Thu, Mar 07, 2024, David Matlack wrote:
> > > > > > > Create memslot 0 at 0x100000000 (4GiB) to avoid it overlapping with
> > > > > > > KVM's private memslot for the APIC-access page.
> > > > > >
> > > > > > Any chance we can solve this by using huge pages in the guest, and adjusting the
> > > > > > gorilla math in vm_nr_pages_required() accordingly?  There's really no reason to
> > > > > > use 4KiB pages for a VM with 256GiB of memory.  That'd also be more represantitive
> > > > > > of real world workloads (at least, I hope real world workloads are using 2MiB or
> > > > > > 1GiB pages in this case).
> > > > >
> > > > > There are real world workloads that use TiB of RAM with 4KiB mappings
> > > > > (looking at you SAP HANA).
> > > > >
> > > > > What about giving tests an explicit "start" GPA they can use? That would
> > > > > fix max_guest_memory_test and avoid tests making assumptions about 4GiB
> > > > > being a magically safe address to use.
> > >
> > > So, rather than more hardcoded addresses and/or a knob to control _all_ code
> > > allocations, I think we should provide knob to say that MEM_REGION_PT should go
> > > to memory above 4GiB. And to make memslot handling maintainable in the long term:
> > >
> > >   1. Add a knob to place MEM_REGION_PT at 4GiB (and as of this initial patch,
> > >      conditionally in their own memslot).
> > >
> > >   2. Use the PT_AT_4GIB (not the real name) knob for the various memstress tests
> > >      that need it.
> >
> > Making tests pick when to place page tables at 4GiB seems unnecessary.
> > Tests that don't otherwise need a specific physical memory layout
> > should be able to create a VM with any amount of memory and have it
> > just work.
> >
> > It's also not impossible that a test has 4GiB+ .bss because the guest
> > needs a big array for something. In that case we'd need a knob to move
> > MEM_REGION_CODE above 4GiB on x86_64 as well.
>
> LOL, at that point, the test can darn well dynamically allocate its memory.
> Though I'd love to see a test that needs a 3GiB array :-)
>
> > For x86_64 (which is the only architecture AFAIK that has a private
> > memslot in KVM the framework can overlap with), what's the downside of
> > always putting all memslots above 4GiB?
>
> Divergence from other architectures, divergence from "real" VM configurations,
> and a more compliciated programming environment for the vast majority of tests.
> E.g. a test that uses more than ~3GiB of memory would need to dynamically place
> its test specific memslots, whereas if the core library keeps everything under
> 4GiB by default, then on x86 every test knows it has 4GiB+ to play with.

Divergence from real VM configurations is a really good point. I was
thinking we _could_ make all architectures start at 4GiB and make
32GiB the new static address available for tests to solve the
architecture divergence and complexity problems. But that would mean
all KVM selftests don't use GPAs 0-4GiB, and that seems like a
terrible idea now that you point it out.

Thanks for the thoughtful feedback. I'll give your suggestions a try in v2.

>
> One could argue that dynamically placing test specific would be more elegant,
> but I'd prefer to avoid that because I can't think of any value it would add
> from a test coverage perspective, and static addresses are much easier when it
> comes to debug.
>
> Having tests use e.g. 2GiB-3GiB or 1GiB-3GiB, would kinda sorta work, but that
> 2GiB limit isn't a trivial, e.g. max_guest_memory_test creates TiBs of memslots.
>
> IMO, memstress is the odd one out, it should be the one that needs to do special
> things.

Fair point.

>
> > >   3. Formalize memslots 0..2 (CODE, DATA, and PT) as being owned by the library,
> > >      with memslots 3..MAX available for test usage.
> > >
> > >   4. Modify tests that assume memslots 1..MAX are available, i.e. force them to
> > >      start at MEM_REGION_TEST_DATA.
> >
> > I think MEM_REGION_TEST_DATA is just where the framework will satisfy
> > test-initiated dynamic memory allocations. That's different from which
> > slots are free for the test to use.
> >
> > But assuming I understand your intention, I agree in spirit... Tests
> > should be allowed to use slots TEST_SLOT..MAX and physical addresses
> > TEST_GPA..MAX. The framework should provide both TEST_SLOT and
> > TEST_GPA (names pending), and existing tests should use those instead
> > of random hard-coded values.
> >
> > >
> > >   5. Use separate memslots for CODE, DATA, and PT by default.  This will allow
> > >      for more precise sizing of the CODE and DATA slots.
> >
> > What do you mean by "[separate memslots] will allow for more precise sizing"?
>
> I suspect there is a _lot_ of slop in the arbitrary 512 pages that are tacked on
> by vm_nr_pages_required().  Those 2MiBs probably don't matter, I just don't like
> completely magical numbers.

That makes sense, we can probably tighten up those heuristics and
maybe even get rid of the magic numbers. But I wasn't following what
_separate memslots_ has to do with it?
Sean Christopherson March 15, 2024, 5:25 p.m. UTC | #8
On Fri, Mar 15, 2024, David Matlack wrote:
> On Thu, Mar 14, 2024 at 4:14 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >   5. Use separate memslots for CODE, DATA, and PT by default.  This will allow
> > > >      for more precise sizing of the CODE and DATA slots.
> > >
> > > What do you mean by "[separate memslots] will allow for more precise sizing"?
> >
> > I suspect there is a _lot_ of slop in the arbitrary 512 pages that are tacked on
> > by vm_nr_pages_required().  Those 2MiBs probably don't matter, I just don't like
> > completely magical numbers.
> 
> That makes sense, we can probably tighten up those heuristics and
> maybe even get rid of the magic numbers. But I wasn't following what
> _separate memslots_ has to do with it?

Ah, don't dwell on that too much, it was somewhat of a passing thought.

My thinking was that the gorilla math for the page tables makes it hard to determine
how much of those 512 pages is actually for DATA, versus how much is actually a
weird recursive calculation for the page tables themselves.  I.e. it's hard to
trim down @nr_pages, because it might no be easy to tell if DATA shrank too much,
or if it actually caused to PT to shrink too much, and so no one touches that mess.
Nico Boehr April 8, 2024, 1:47 p.m. UTC | #9
Quoting David Matlack (2024-03-14 22:11:57)
[...]
> >   7. Use the PT_AT_4GIB knob in s390's CMMA test?  I suspect it does memslot
> >      shenanigans purely so that a low gfn (4096 in the test) is guaranteed to
> >      be available.
> 
> +Nico
> 
> Hm, if this test _needs_ to use GFN 4096, then maybe the framework can
> give tests two regions 0..KVM_FRAMEWORK_GPA and TEST_GPA..MAX.
> 
> If the test just needs any GFN then it can use TEST_GPA instead of
> 4096 << page_shift.

Sorry for the late reply, this got burried at the bottom of my inbox :(

The test creates two memslots with a gap in between to be able to test that
gaps in memslots are correctly skipped by the CMMA-related ioctls.

The test doesn't need GFN 4096. It should not matter where TEST_GPA is, as
long as there's a gap after the main memslot.

Feel free to include me on a follow-up, then I can test and look at this.
diff mbox series

Patch

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 3bd03b088dda..ee33528007ee 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -20,6 +20,12 @@ 
 
 #include "../kvm_util.h"
 
+/*
+ * Create memslot 0 at 4GiB to avoid overlapping with KVM's private memslot for
+ * the APIC-access page at 0xfee00000.
+ */
+#define KVM_UTIL_MEMSLOT0_GPA 0x100000000ULL
+
 extern bool host_cpu_is_intel;
 extern bool host_cpu_is_amd;
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index b2262b5fad9e..c8d7e66d308d 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -20,6 +20,10 @@ 
 
 #define KVM_UTIL_MIN_PFN	2
 
+#ifndef KVM_UTIL_MEMSLOT0_GPA
+#define KVM_UTIL_MEMSLOT0_GPA 0ULL
+#endif
+
 static int vcpu_mmap_sz(void);
 
 int open_path_or_exit(const char *path, int flags)
@@ -418,7 +422,8 @@  struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
 
 	vm = ____vm_create(shape);
 
-	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 0);
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+				    KVM_UTIL_MEMSLOT0_GPA, 0, nr_pages, 0);
 	for (i = 0; i < NR_MEM_REGIONS; i++)
 		vm->memslots[i] = 0;