diff mbox series

KVM: selftests: Add a requirement for disabling numa balancing

Message ID 20240117064441.2633784-1-tao1.su@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: selftests: Add a requirement for disabling numa balancing | expand

Commit Message

Tao Su Jan. 17, 2024, 6:44 a.m. UTC
In dirty_log_page_splitting_test, vm_get_stat(vm, "pages_4k") has
probability of gradually reducing to 0 after vm exit. The reason is that
the host triggers numa balancing and unmaps the related spte. So, the
number of pages currently mapped in EPT (kvm->stat.pages) is not equal
to the pages touched by the guest, which causes stats_populated.pages_4k
and stats_repopulated.pages_4k in this test are not same, resulting in
failure.

The calltrace of unmapping spte triggered by numa balancing:
	handle_changed_spte+0x64b/0x830 [kvm]
	tdp_mmu_zap_leafs+0x159/0x290 [kvm]
	kvm_tdp_mmu_unmap_gfn_range+0x7b/0xa0 [kvm]
	kvm_unmap_gfn_range+0x10f/0x130 [kvm]
	? _raw_spin_unlock+0x1d/0x40
	? hugetlb_follow_page_mask+0x1ba/0x400
	? preempt_count_add+0x86/0xd0
	kvm_mmu_notifier_invalidate_range_start+0x14d/0x380 [kvm]
	__mmu_notifier_invalidate_range_start+0x89/0x1f0
	change_protection+0xce1/0x1490
	? __pfx_tlb_is_not_lazy+0x10/0x10
	change_prot_numa+0x5d/0xb0
	? kmalloc_trace+0x2e/0xa0
	task_numa_work+0x364/0x550
	task_work_run+0x62/0xa0
	xfer_to_guest_mode_handle_work+0xc3/0xd0
	kvm_arch_vcpu_ioctl_run+0xe8e/0x1b90 [kvm]
	kvm_vcpu_ioctl+0x282/0x710 [kvm]

dirty_log_page_splitting_test assumes that kvm->stat.pages and the pages
touched by the guest are the same, but the assumption is no longer true
if numa balancing is enabled. Add a requirement for disabling
numa_balancing to avoid confusing due to test failure.

Actually, all page migration (including numa balancing) will trigger this
issue, e.g. running script:
	./x86_64/dirty_log_page_splitting_test &
	PID=$!
	sleep 1
	migratepages $PID 0 1
It is unusual to create above test environment intentionally, but numa
balancing initiated by the kernel will most likely be triggered, at
least in dirty_log_page_splitting_test.

Reported-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
---
 .../kvm/x86_64/dirty_log_page_splitting_test.c        | 11 +++++++++++
 1 file changed, 11 insertions(+)


base-commit: 052d534373b7ed33712a63d5e17b2b6cdbce84fd

Comments

Sean Christopherson Jan. 17, 2024, 3:12 p.m. UTC | #1
On Wed, Jan 17, 2024, Tao Su wrote:
> In dirty_log_page_splitting_test, vm_get_stat(vm, "pages_4k") has
> probability of gradually reducing to 0 after vm exit. The reason is that
> the host triggers numa balancing and unmaps the related spte. So, the
> number of pages currently mapped in EPT (kvm->stat.pages) is not equal
> to the pages touched by the guest, which causes stats_populated.pages_4k
> and stats_repopulated.pages_4k in this test are not same, resulting in
> failure.

...

> dirty_log_page_splitting_test assumes that kvm->stat.pages and the pages
> touched by the guest are the same, but the assumption is no longer true
> if numa balancing is enabled. Add a requirement for disabling
> numa_balancing to avoid confusing due to test failure.
> 
> Actually, all page migration (including numa balancing) will trigger this
> issue, e.g. running script:
> 	./x86_64/dirty_log_page_splitting_test &
> 	PID=$!
> 	sleep 1
> 	migratepages $PID 0 1
> It is unusual to create above test environment intentionally, but numa
> balancing initiated by the kernel will most likely be triggered, at
> least in dirty_log_page_splitting_test.
> 
> Reported-by: Yi Lai <yi1.lai@intel.com>
> Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> Tested-by: Yi Lai <yi1.lai@intel.com>
> ---
>  .../kvm/x86_64/dirty_log_page_splitting_test.c        | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> index 634c6bfcd572..f2c796111d83 100644
> --- a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> +++ b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> @@ -212,10 +212,21 @@ static void help(char *name)
>  
>  int main(int argc, char *argv[])
>  {
> +	FILE *f;
>  	int opt;
> +	int ret, numa_balancing;
>  
>  	TEST_REQUIRE(get_kvm_param_bool("eager_page_split"));
>  	TEST_REQUIRE(get_kvm_param_bool("tdp_mmu"));
> +	f = fopen("/proc/sys/kernel/numa_balancing", "r");
> +	if (f) {
> +		ret = fscanf(f, "%d", &numa_balancing);
> +		TEST_ASSERT(ret == 1, "Error reading numa_balancing");
> +		TEST_ASSERT(!numa_balancing, "please run "
> +			    "'echo 0 > /proc/sys/kernel/numa_balancing'");

If we go this route, this should be a TEST_REQUIRE(), not a TEST_ASSERT().  The
test hasn't failed, rather it has detected an incompatible setup.

Something isn't right though.  The test defaults to HugeTLB, and the invocation
in the changelog doesn't override the backing source.  That suggests that NUMA
auto-balancing is zapping HugeTLB VMAs, which AFAIK shouldn't happen, e.g. this
code in task_numa_work() should cause such VMAs to be skipped:

		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
			continue;
		}

And the test already warns the user if they opt to use something other than
HugeTLB.

	if (!is_backing_src_hugetlb(backing_src)) {
		pr_info("This test will only work reliably with HugeTLB memory. "
			"It can work with THP, but that is best effort.\n");
	}

If the test is defaulting to something other than HugeTLB, then we should fix
that in the test.  If the kernel is doing NUMA balancing on HugeTLB VMAs, then
we should fix that in the kernel.
Tao Su Jan. 18, 2024, 5:43 a.m. UTC | #2
On Wed, Jan 17, 2024 at 07:12:08AM -0800, Sean Christopherson wrote:

[...]

> > diff --git a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> > index 634c6bfcd572..f2c796111d83 100644
> > --- a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> > +++ b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> > @@ -212,10 +212,21 @@ static void help(char *name)
> >  
> >  int main(int argc, char *argv[])
> >  {
> > +	FILE *f;
> >  	int opt;
> > +	int ret, numa_balancing;
> >  
> >  	TEST_REQUIRE(get_kvm_param_bool("eager_page_split"));
> >  	TEST_REQUIRE(get_kvm_param_bool("tdp_mmu"));
> > +	f = fopen("/proc/sys/kernel/numa_balancing", "r");
> > +	if (f) {
> > +		ret = fscanf(f, "%d", &numa_balancing);
> > +		TEST_ASSERT(ret == 1, "Error reading numa_balancing");
> > +		TEST_ASSERT(!numa_balancing, "please run "
> > +			    "'echo 0 > /proc/sys/kernel/numa_balancing'");
> 
> If we go this route, this should be a TEST_REQUIRE(), not a TEST_ASSERT().  The
> test hasn't failed, rather it has detected an incompatible setup.

Yes, previously I wanted to print a more user-friendly prompt, but TEST_REQUIRE()
can’t customize the output…

> 
> Something isn't right though.  The test defaults to HugeTLB, and the invocation
> in the changelog doesn't override the backing source.  That suggests that NUMA
> auto-balancing is zapping HugeTLB VMAs, which AFAIK shouldn't happen, e.g. this
> code in task_numa_work() should cause such VMAs to be skipped:
> 
> 		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
> 			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
> 			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
> 			continue;
> 		}
> 
> And the test already warns the user if they opt to use something other than
> HugeTLB.
> 
> 	if (!is_backing_src_hugetlb(backing_src)) {
> 		pr_info("This test will only work reliably with HugeTLB memory. "
> 			"It can work with THP, but that is best effort.\n");
> 	}
> 
> If the test is defaulting to something other than HugeTLB, then we should fix
> that in the test.  If the kernel is doing NUMA balancing on HugeTLB VMAs, then
> we should fix that in the kernel.

HugeTLB VMAs are not affected by NUMA auto-balancing through my observation, but
the backing sources of the test code and per-vCPU stacks are not Huge TLB, e.g.
__vm_create() invokes

        vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 0);

So, some pages are possible to be migrated.

In dirty_log_page_splitting_test, if sleep 3s before enabling dirty logging:

        --- a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
        +++ b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
        @@ -124,6 +124,7 @@ static void run_test(enum vm_guest_mode mode, void *unused)
                memstress_start_vcpu_threads(VCPUS, vcpu_worker);

                run_vcpu_iteration(vm);
        +       sleep(3);
                get_page_stats(vm, &stats_populated, "populating memory");

                /* Enable dirty logging */

I got these logs:

        # ./x86_64/dirty_log_page_splitting_test
        Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
        __vm_create: mode='PA-bits:ANY, VA-bits:48,  4K pages' pages='2710'
        Guest physical address width detected: 52
        guest physical test memory: [0xfffff7fe00000, 0xfffffffe00000)
        Added VCPU 0 with test mem gpa [fffff7fe00000, fffffffe00000)
        Added VCPU 1 with test mem gpa [fffff7fe00000, fffffffe00000)

        Page stats after populating memory: 4K: 0 2M: 1024 1G: 0 huge: 1024

        Page stats after enabling dirty logging: 4K: 524288 2M: 0 1G: 0 huge: 0

        Page stats after dirtying memory: 4K: 525334 2M: 0 1G: 0 huge: 0

        Page stats after dirtying memory: 4K: 525334 2M: 0 1G: 0 huge: 0

        Page stats after disabling dirty logging: 4K: 1046 2M: 0 1G: 0 huge: 0

        Page stats after repopulating memory: 4K: 1046 2M: 1024 1G: 0 huge: 1024
        ==== Test Assertion Failure ====
          x86_64/dirty_log_page_splitting_test.c:196: stats_populated.pages_4k == stats_repopulated.pages_4k
          pid=2660413 tid=2660413 errno=0 - Success
             1  0x0000000000402d4b: run_test at dirty_log_page_splitting_test.c:196 (discriminator 1)
             2  0x0000000000403724: for_each_guest_mode at guest_modes.c:100
             3  0x00000000004024ef: main at dirty_log_page_splitting_test.c:246
             4  0x00007fd72c229d8f: ?? ??:0
             5  0x00007fd72c229e3f: ?? ??:0
             6  0x00000000004025b4: _start at ??:?
          0 != 0x416 (stats_populated.pages_4k != stats_repopulated.pages_4k)

Thanks,
Tao
Sean Christopherson Jan. 18, 2024, 5:33 p.m. UTC | #3
+David

On Thu, Jan 18, 2024, Tao Su wrote:
> On Wed, Jan 17, 2024 at 07:12:08AM -0800, Sean Christopherson wrote:
> 
> [...]
> 
> > > diff --git a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> > > index 634c6bfcd572..f2c796111d83 100644
> > > --- a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> > > +++ b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> > > @@ -212,10 +212,21 @@ static void help(char *name)
> > >  
> > >  int main(int argc, char *argv[])
> > >  {
> > > +	FILE *f;
> > >  	int opt;
> > > +	int ret, numa_balancing;
> > >  
> > >  	TEST_REQUIRE(get_kvm_param_bool("eager_page_split"));
> > >  	TEST_REQUIRE(get_kvm_param_bool("tdp_mmu"));
> > > +	f = fopen("/proc/sys/kernel/numa_balancing", "r");
> > > +	if (f) {
> > > +		ret = fscanf(f, "%d", &numa_balancing);
> > > +		TEST_ASSERT(ret == 1, "Error reading numa_balancing");
> > > +		TEST_ASSERT(!numa_balancing, "please run "
> > > +			    "'echo 0 > /proc/sys/kernel/numa_balancing'");
> > 
> > If we go this route, this should be a TEST_REQUIRE(), not a TEST_ASSERT().  The
> > test hasn't failed, rather it has detected an incompatible setup.
> 
> Yes, previously I wanted to print a more user-friendly prompt, but TEST_REQUIRE()
> can’t customize the output…

__TEST_REQUIRE()

> > Something isn't right though.  The test defaults to HugeTLB, and the invocation
> > in the changelog doesn't override the backing source.  That suggests that NUMA
> > auto-balancing is zapping HugeTLB VMAs, which AFAIK shouldn't happen, e.g. this
> > code in task_numa_work() should cause such VMAs to be skipped:
> > 
> > 		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
> > 			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
> > 			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
> > 			continue;
> > 		}
> > 
> > And the test already warns the user if they opt to use something other than
> > HugeTLB.
> > 
> > 	if (!is_backing_src_hugetlb(backing_src)) {
> > 		pr_info("This test will only work reliably with HugeTLB memory. "
> > 			"It can work with THP, but that is best effort.\n");
> > 	}
> > 
> > If the test is defaulting to something other than HugeTLB, then we should fix
> > that in the test.  If the kernel is doing NUMA balancing on HugeTLB VMAs, then
> > we should fix that in the kernel.
> 
> HugeTLB VMAs are not affected by NUMA auto-balancing through my observation, but
> the backing sources of the test code and per-vCPU stacks are not Huge TLB, e.g.
> __vm_create() invokes
> 
>         vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 0);
> 
> So, some pages are possible to be migrated.

Ah, hmm.  Requiring NUMA balancing be disabled isn't going to fix the underlying
issue, it's just guarding against one of the more likely culprits.  The best fix
is likely to have the test precisely validate _only_ the test data pages.  E.g.
if we double down on requiring HugeTLB, then the test should be able to assert
that it has at least N hugepages when dirty logging is disabled, and at least M
4KiB pages when dirty logging is enabled.
Tao Su Jan. 19, 2024, 6:05 a.m. UTC | #4
On Thu, Jan 18, 2024 at 09:33:29AM -0800, Sean Christopherson wrote:
> +David
> 
> On Thu, Jan 18, 2024, Tao Su wrote:
> > On Wed, Jan 17, 2024 at 07:12:08AM -0800, Sean Christopherson wrote:
> > 
> > [...]
> > 
> > > > diff --git a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> > > > index 634c6bfcd572..f2c796111d83 100644
> > > > --- a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> > > > +++ b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
> > > > @@ -212,10 +212,21 @@ static void help(char *name)
> > > >  
> > > >  int main(int argc, char *argv[])
> > > >  {
> > > > +	FILE *f;
> > > >  	int opt;
> > > > +	int ret, numa_balancing;
> > > >  
> > > >  	TEST_REQUIRE(get_kvm_param_bool("eager_page_split"));
> > > >  	TEST_REQUIRE(get_kvm_param_bool("tdp_mmu"));
> > > > +	f = fopen("/proc/sys/kernel/numa_balancing", "r");
> > > > +	if (f) {
> > > > +		ret = fscanf(f, "%d", &numa_balancing);
> > > > +		TEST_ASSERT(ret == 1, "Error reading numa_balancing");
> > > > +		TEST_ASSERT(!numa_balancing, "please run "
> > > > +			    "'echo 0 > /proc/sys/kernel/numa_balancing'");
> > > 
> > > If we go this route, this should be a TEST_REQUIRE(), not a TEST_ASSERT().  The
> > > test hasn't failed, rather it has detected an incompatible setup.
> > 
> > Yes, previously I wanted to print a more user-friendly prompt, but TEST_REQUIRE()
> > can’t customize the output…
> 
> __TEST_REQUIRE()

Got it.

> 
> > > Something isn't right though.  The test defaults to HugeTLB, and the invocation
> > > in the changelog doesn't override the backing source.  That suggests that NUMA
> > > auto-balancing is zapping HugeTLB VMAs, which AFAIK shouldn't happen, e.g. this
> > > code in task_numa_work() should cause such VMAs to be skipped:
> > > 
> > > 		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
> > > 			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
> > > 			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
> > > 			continue;
> > > 		}
> > > 
> > > And the test already warns the user if they opt to use something other than
> > > HugeTLB.
> > > 
> > > 	if (!is_backing_src_hugetlb(backing_src)) {
> > > 		pr_info("This test will only work reliably with HugeTLB memory. "
> > > 			"It can work with THP, but that is best effort.\n");
> > > 	}
> > > 
> > > If the test is defaulting to something other than HugeTLB, then we should fix
> > > that in the test.  If the kernel is doing NUMA balancing on HugeTLB VMAs, then
> > > we should fix that in the kernel.
> > 
> > HugeTLB VMAs are not affected by NUMA auto-balancing through my observation, but
> > the backing sources of the test code and per-vCPU stacks are not Huge TLB, e.g.
> > __vm_create() invokes
> > 
> >         vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 0);
> > 
> > So, some pages are possible to be migrated.
> 
> Ah, hmm.  Requiring NUMA balancing be disabled isn't going to fix the underlying
> issue, it's just guarding against one of the more likely culprits.  The best fix
> is likely to have the test precisely validate _only_ the test data pages.  E.g.
> if we double down on requiring HugeTLB, then the test should be able to assert
> that it has at least N hugepages when dirty logging is disabled, and at least M
> 4KiB pages when dirty logging is enabled.

I see, I will update the ASSERT conditions, thanks for your guidance.

Thanks,
Tao
diff mbox series

Patch

diff --git a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
index 634c6bfcd572..f2c796111d83 100644
--- a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
+++ b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
@@ -212,10 +212,21 @@  static void help(char *name)
 
 int main(int argc, char *argv[])
 {
+	FILE *f;
 	int opt;
+	int ret, numa_balancing;
 
 	TEST_REQUIRE(get_kvm_param_bool("eager_page_split"));
 	TEST_REQUIRE(get_kvm_param_bool("tdp_mmu"));
+	f = fopen("/proc/sys/kernel/numa_balancing", "r");
+	if (f) {
+		ret = fscanf(f, "%d", &numa_balancing);
+		TEST_ASSERT(ret == 1, "Error reading numa_balancing");
+		TEST_ASSERT(!numa_balancing, "please run "
+			    "'echo 0 > /proc/sys/kernel/numa_balancing'");
+		fclose(f);
+		f = NULL;
+	}
 
 	while ((opt = getopt(argc, argv, "b:hs:")) != -1) {
 		switch (opt) {