diff mbox series

[v3] KVM: selftests: Fix the dirty_log_test semaphore imbalance

Message ID 20240202064332.9403-1-shahuang@redhat.com (mailing list archive)
State New
Headers show
Series [v3] KVM: selftests: Fix the dirty_log_test semaphore imbalance | expand

Commit Message

Shaoqin Huang Feb. 2, 2024, 6:43 a.m. UTC
When execute the dirty_log_test on some aarch64 machine, it sometimes
trigger the ASSERT:

==== Test Assertion Failure ====
  dirty_log_test.c:384: dirty_ring_vcpu_ring_full
  pid=14854 tid=14854 errno=22 - Invalid argument
     1  0x00000000004033eb: dirty_ring_collect_dirty_pages at dirty_log_test.c:384
     2  0x0000000000402d27: log_mode_collect_dirty_pages at dirty_log_test.c:505
     3   (inlined by) run_test at dirty_log_test.c:802
     4  0x0000000000403dc7: for_each_guest_mode at guest_modes.c:100
     5  0x0000000000401dff: main at dirty_log_test.c:941 (discriminator 3)
     6  0x0000ffff9be173c7: ?? ??:0
     7  0x0000ffff9be1749f: ?? ??:0
     8  0x000000000040206f: _start at ??:?
  Didn't continue vcpu even without ring full

The dirty_log_test fails when execute the dirty-ring test, this is
because the sem_vcpu_cont and the sem_vcpu_stop is non-zero value when
execute the dirty_ring_collect_dirty_pages() function. When those two
sem_t variables are non-zero, the dirty_ring_wait_vcpu() at the
beginning of the dirty_ring_collect_dirty_pages() will not wait for the
vcpu to stop, but continue to execute the following code. In this case,
before vcpu stop, if the dirty_ring_vcpu_ring_full is true, and the
dirty_ring_collect_dirty_pages() has passed the check for the
dirty_ring_vcpu_ring_full but hasn't execute the check for the
continued_vcpu, the vcpu stop, and set the dirty_ring_vcpu_ring_full to
false. Then dirty_ring_collect_dirty_pages() will trigger the ASSERT.

Why sem_vcpu_cont and sem_vcpu_stop can be non-zero value? It's because
the dirty_ring_before_vcpu_join() execute the sem_post(&sem_vcpu_cont)
at the end of each dirty-ring test. It can cause two cases:

1. sem_vcpu_cont be non-zero. When we set the host_quit to be true,
   the vcpu_worker directly see the host_quit to be true, it quit. So
   the log_mode_before_vcpu_join() function will set the sem_vcpu_cont
   to 1, since the vcpu_worker has quit, it won't consume it.
2. sem_vcpu_stop be non-zero. When we set the host_quit to be true,
   the vcpu_worker has entered the guest state, the next time it exit
   from guest state, it will set the sem_vcpu_stop to 1, and then see
   the host_quit, no one will consume the sem_vcpu_stop.

When execute more and more dirty-ring tests, the sem_vcpu_cont and
sem_vcpu_stop can be larger and larger, which makes many code paths
don't wait for the sem_t. Thus finally cause the problem.

To fix this problem, we can wait a while before set the host_quit to
true, which gives the vcpu time to enter the guest state, so it will
exit again. Then we can wait the vcpu to exit, and let it continue
again, then the vcpu will see the host_quit. Thus the sem_vcpu_cont and
sem_vcpu_stop will be both zero when test finished.

Signed-off-by: Shaoqin Huang <shahuang@redhat.com>
---
v2->v3:
  - Rebase to v6.8-rc2.
  - Use TEST_ASSERT().

v1->v2:
  - Fix the real logic bug, not just fresh the context.

v1: https://lore.kernel.org/all/20231116093536.22256-1-shahuang@redhat.com/
v2: https://lore.kernel.org/all/20231117052210.26396-1-shahuang@redhat.com/

 tools/testing/selftests/kvm/dirty_log_test.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)


base-commit: 41bccc98fb7931d63d03f326a746ac4d429c1dd3

Comments

Sean Christopherson Feb. 2, 2024, 5:46 p.m. UTC | #1
On Fri, Feb 02, 2024, Shaoqin Huang wrote:
> ---
> v2->v3:
>   - Rebase to v6.8-rc2.
>   - Use TEST_ASSERT().

Patch says otherwise.

> @@ -726,6 +728,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>  		return;
>  	}
>  
> +	sem_getvalue(&sem_vcpu_stop, &sem_val);
> +	assert(sem_val == 0);
> +	sem_getvalue(&sem_vcpu_cont, &sem_val);
> +	assert(sem_val == 0);
> +
>  	/*
>  	 * We reserve page table for 2 times of extra dirty mem which
>  	 * will definitely cover the original (1G+) test range.  Here
> @@ -825,6 +832,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>  		sync_global_to_guest(vm, iteration);
>  	}
>  
> +	/*
> +	 *
> +	 * Before we set the host_quit, let the vcpu has time to run, to make
> +	 * sure we consume the sem_vcpu_stop and the vcpu consume the
> +	 * sem_vcpu_cont, to keep the semaphore balance.
> +	 */
> +	usleep(p->interval * 1000);

Sorry, I wasn't as explicit as I should have been.  When I said I don't have a
better solution, I did not mean to imply that I am ok with busy waiting as a
hack-a-around.

Against my better judgment, I spent half an hour slogging through this test to
figure out what's going wrong.  IIUC, the problem is essentially that the test
instructs the vCPU worker to continue _after_ the last iteration, and _then_ sets
host_quit, which results in the vCPU running one extra (unvalidated) iteration.

For the other modes, which stop if and only if vcpu_sync_stop_requested is set,
the extra iteration is a non-issue.  But because the dirty ring variant stops
after every exit (to purge the ring), it hangs without an extra "continue".

So rather than blindly fire off an extra sem_vcpu_cont that may or may not be
consumed, I believe the test can simply set host_quit _before_ the final "continue"
so that the vCPU worker doesn't run an extra iteration.

I ran the below with 1000 loops of "for (i = 0; i < LOG_MODE_NUM; i++)" and so
no issues.  I didn't see the assert you hit, but without the fix, I did see this
fire within a few loops (less than 5 I think)l

	assert(host_log_mode == LOG_MODE_DIRTY_RING ||
	       atomic_read(&vcpu_sync_stop_requested) == false);

I'll post this as two patches: one to fix the bug, and a second to have the
LOG_MODE_DIRTY_RING variant clear vcpu_sync_stop_requested so that the above
assert() can be modified as below.

---
 tools/testing/selftests/kvm/dirty_log_test.c | 62 ++++++++++++--------
 1 file changed, 37 insertions(+), 25 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index babea97b31a4..1a93c4231e45 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -376,7 +376,10 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
 
 	cleared = kvm_vm_reset_dirty_ring(vcpu->vm);
 
-	/* Cleared pages should be the same as collected */
+	/*
+	 * Cleared pages should be the same as collected, as KVM is supposed to
+	 * clear only the entries that have been harvested.
+	 */
 	TEST_ASSERT(cleared == count, "Reset dirty pages (%u) mismatch "
 		    "with collected (%u)", cleared, count);
 
@@ -402,6 +405,15 @@ static void dirty_ring_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
 		/* Update the flag first before pause */
 		WRITE_ONCE(dirty_ring_vcpu_ring_full,
 			   run->exit_reason == KVM_EXIT_DIRTY_RING_FULL);
+
+		/*
+		 * Unconditionally clear the stop request.  This dirty ring
+		 * variant doesn't actually consume the request, because the
+		 * vCPU worker stops after every exit to allow the main task
+		 * to reap the dirty ring.  But the vCPU is still obviously
+		 * stopped, i.e. has honored the request if one was made.
+		 */
+		atomic_set(&vcpu_sync_stop_requested, false);
 		sem_post(&sem_vcpu_stop);
 		pr_info("vcpu stops because %s...\n",
 			dirty_ring_vcpu_ring_full ?
@@ -415,12 +427,6 @@ static void dirty_ring_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
 	}
 }
 
-static void dirty_ring_before_vcpu_join(void)
-{
-	/* Kick another round of vcpu just to make sure it will quit */
-	sem_post(&sem_vcpu_cont);
-}
-
 struct log_mode {
 	const char *name;
 	/* Return true if this mode is supported, otherwise false */
@@ -433,7 +439,6 @@ struct log_mode {
 				     uint32_t *ring_buf_idx);
 	/* Hook to call when after each vcpu run */
 	void (*after_vcpu_run)(struct kvm_vcpu *vcpu, int ret, int err);
-	void (*before_vcpu_join) (void);
 } log_modes[LOG_MODE_NUM] = {
 	{
 		.name = "dirty-log",
@@ -452,7 +457,6 @@ struct log_mode {
 		.supported = dirty_ring_supported,
 		.create_vm_done = dirty_ring_create_vm_done,
 		.collect_dirty_pages = dirty_ring_collect_dirty_pages,
-		.before_vcpu_join = dirty_ring_before_vcpu_join,
 		.after_vcpu_run = dirty_ring_after_vcpu_run,
 	},
 };
@@ -513,14 +517,6 @@ static void log_mode_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
 		mode->after_vcpu_run(vcpu, ret, err);
 }
 
-static void log_mode_before_vcpu_join(void)
-{
-	struct log_mode *mode = &log_modes[host_log_mode];
-
-	if (mode->before_vcpu_join)
-		mode->before_vcpu_join();
-}
-
 static void generate_random_array(uint64_t *guest_array, uint64_t size)
 {
 	uint64_t i;
@@ -719,6 +715,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	unsigned long *bmap;
 	uint32_t ring_buf_idx = 0;
+	int sem_val;
 
 	if (!log_mode_supported()) {
 		print_skip("Log mode '%s' not supported",
@@ -788,12 +785,22 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	/* Start the iterations */
 	iteration = 1;
 	sync_global_to_guest(vm, iteration);
-	host_quit = false;
+	WRITE_ONCE(host_quit, false);
 	host_dirty_count = 0;
 	host_clear_count = 0;
 	host_track_next_count = 0;
 	WRITE_ONCE(dirty_ring_vcpu_ring_full, false);
 
+	/*
+	 * Ensure the previous iteration didn't leave a dangling semaphore, i.e.
+	 * that the main task and vCPU worker were synchronized and completed
+	 * verification of all iterations.
+	 */
+	sem_getvalue(&sem_vcpu_stop, &sem_val);
+	TEST_ASSERT_EQ(sem_val, 0);
+	sem_getvalue(&sem_vcpu_cont, &sem_val);
+	TEST_ASSERT_EQ(sem_val, 0);
+
 	pthread_create(&vcpu_thread, NULL, vcpu_worker, vcpu);
 
 	while (iteration < p->iterations) {
@@ -816,18 +823,23 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		 * the flush of the last page, and since we handle the last
 		 * page specially verification will succeed anyway.
 		 */
-		assert(host_log_mode == LOG_MODE_DIRTY_RING ||
-		       atomic_read(&vcpu_sync_stop_requested) == false);
+		TEST_ASSERT_EQ(atomic_read(&vcpu_sync_stop_requested), false);
 		vm_dirty_log_verify(mode, bmap);
+
+		/*
+		 * Set host_quit before sem_vcpu_cont in the final iteration to
+		 * ensure that the vCPU worker doesn't resume the guest.  As
+		 * above, the dirty ring test may stop and wait even when not
+		 * explicitly request to do so, i.e. would hang waiting for a
+		 * "continue" if it's allowed to resume the guest.
+		 */
+		if (++iteration == p->iterations)
+			WRITE_ONCE(host_quit, true);
+
 		sem_post(&sem_vcpu_cont);
-
-		iteration++;
 		sync_global_to_guest(vm, iteration);
 	}
 
-	/* Tell the vcpu thread to quit */
-	host_quit = true;
-	log_mode_before_vcpu_join();
 	pthread_join(vcpu_thread, NULL);
 
 	pr_info("Total bits checked: dirty (%"PRIu64"), clear (%"PRIu64"), "

base-commit: 78651ed78b3496b6dbf1fb7d64d9d9736a23edc0
--
Sean Christopherson Feb. 2, 2024, 9:14 p.m. UTC | #2
On Fri, Feb 02, 2024, Sean Christopherson wrote:
> On Fri, Feb 02, 2024, Shaoqin Huang wrote:
> > ---
> > v2->v3:
> >   - Rebase to v6.8-rc2.
> >   - Use TEST_ASSERT().
> 
> Patch says otherwise.
> 
> > @@ -726,6 +728,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >  		return;
> >  	}
> >  
> > +	sem_getvalue(&sem_vcpu_stop, &sem_val);
> > +	assert(sem_val == 0);
> > +	sem_getvalue(&sem_vcpu_cont, &sem_val);
> > +	assert(sem_val == 0);
> > +
> >  	/*
> >  	 * We reserve page table for 2 times of extra dirty mem which
> >  	 * will definitely cover the original (1G+) test range.  Here
> > @@ -825,6 +832,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >  		sync_global_to_guest(vm, iteration);
> >  	}
> >  
> > +	/*
> > +	 *
> > +	 * Before we set the host_quit, let the vcpu has time to run, to make
> > +	 * sure we consume the sem_vcpu_stop and the vcpu consume the
> > +	 * sem_vcpu_cont, to keep the semaphore balance.
> > +	 */
> > +	usleep(p->interval * 1000);
> 
> Sorry, I wasn't as explicit as I should have been.  When I said I don't have a
> better solution, I did not mean to imply that I am ok with busy waiting as a
> hack-a-around.
> 
> Against my better judgment, I spent half an hour slogging through this test to
> figure out what's going wrong.  IIUC, the problem is essentially that the test
> instructs the vCPU worker to continue _after_ the last iteration, and _then_ sets
> host_quit, which results in the vCPU running one extra (unvalidated) iteration.
> 
> For the other modes, which stop if and only if vcpu_sync_stop_requested is set,
> the extra iteration is a non-issue.  But because the dirty ring variant stops
> after every exit (to purge the ring), it hangs without an extra "continue".
> 
> So rather than blindly fire off an extra sem_vcpu_cont that may or may not be
> consumed, I believe the test can simply set host_quit _before_ the final "continue"
> so that the vCPU worker doesn't run an extra iteration.
> 
> I ran the below with 1000 loops of "for (i = 0; i < LOG_MODE_NUM; i++)" and so
> no issues.  I didn't see the assert you hit, but without the fix, I did see this
> fire within a few loops (less than 5 I think)l
> 
> 	assert(host_log_mode == LOG_MODE_DIRTY_RING ||
> 	       atomic_read(&vcpu_sync_stop_requested) == false);
> 
> I'll post this as two patches: one to fix the bug, and a second to have the
> LOG_MODE_DIRTY_RING variant clear vcpu_sync_stop_requested so that the above
> assert() can be modified as below.

Scratch patch 2, I'm pretty sure the vCPU worker can race with the main task and
clear vcpu_sync_stop_requested just before the main task sets it, which would
result in a false positive.  I didn't see any failures, but I'm 99% certain the
race exists.  I suspect there are some other warts in the test that would
complicate attempts to clean things up, so for now I'l just post the fix for
the imbalance bug.
Peter Xu Feb. 5, 2024, 10:05 a.m. UTC | #3
Shaoqin, Sean,

Apologies for a late comment.  I'm trying to remember what I wrote..

On Fri, Feb 02, 2024 at 01:43:32AM -0500, Shaoqin Huang wrote:
> Why sem_vcpu_cont and sem_vcpu_stop can be non-zero value? It's because
> the dirty_ring_before_vcpu_join() execute the sem_post(&sem_vcpu_cont)
> at the end of each dirty-ring test. It can cause two cases:

As a possible alternative, would it work if we simply reset all the sems
for each run?  Then we don't care about the leftovers.  E.g. sem_destroy()
at the end of run_test(), then always init to 0 at entry.
Peter Xu Feb. 5, 2024, 10:18 a.m. UTC | #4
On Mon, Feb 05, 2024 at 06:05:02PM +0800, Peter Xu wrote:
> Shaoqin, Sean,
> 
> Apologies for a late comment.  I'm trying to remember what I wrote..
> 
> On Fri, Feb 02, 2024 at 01:43:32AM -0500, Shaoqin Huang wrote:
> > Why sem_vcpu_cont and sem_vcpu_stop can be non-zero value? It's because
> > the dirty_ring_before_vcpu_join() execute the sem_post(&sem_vcpu_cont)
> > at the end of each dirty-ring test. It can cause two cases:
> 
> As a possible alternative, would it work if we simply reset all the sems
> for each run?  Then we don't care about the leftovers.  E.g. sem_destroy()
> at the end of run_test(), then always init to 0 at entry.

One more thing when I was reading the code again: I had a feeling that I
missed one call to vcpu_handle_sync_stop() for the dirty ring case:

======
@@ -395,8 +395,7 @@ static void dirty_ring_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
 
        /* A ucall-sync or ring-full event is allowed */
        if (get_ucall(vcpu, NULL) == UCALL_SYNC) {
-               /* We should allow this to continue */
-               ;
+               vcpu_handle_sync_stop();
        } else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL ||
                   (ret == -1 && err == EINTR)) {
                /* Update the flag first before pause */
======

Otherwise it'll be meaningless for run_test() to set
vcpu_sync_stop_requested for the ring test, if the ring test never reads
it..

Without about change, the test will still work (and I assume that's why
nobody noticed including myself..), but IIUC the vcpu can stop later,
e.g. until the ring fulls, or there's some leftover SIGUSR1 around.

With this change, the vcpu can stop earlier, as soon as the main thread
requested a stop, which should be what the code wanted to do.

Shaoqin, feel free to have a look there too if you're working on the test.

Thanks,
Sean Christopherson Feb. 5, 2024, 3:48 p.m. UTC | #5
On Mon, Feb 05, 2024, Peter Xu wrote:
> Shaoqin, Sean,
> 
> Apologies for a late comment.  I'm trying to remember what I wrote..
> 
> On Fri, Feb 02, 2024 at 01:43:32AM -0500, Shaoqin Huang wrote:
> > Why sem_vcpu_cont and sem_vcpu_stop can be non-zero value? It's because
> > the dirty_ring_before_vcpu_join() execute the sem_post(&sem_vcpu_cont)
> > at the end of each dirty-ring test. It can cause two cases:
> 
> As a possible alternative, would it work if we simply reset all the sems
> for each run?  Then we don't care about the leftovers.  E.g. sem_destroy()
> at the end of run_test(), then always init to 0 at entry.

Gah, I posted v4[*] and didn't Cc you.  I would prefer not to reset the semaphores,
mostly because it encourages sloppiness in the test.  There's no reason to allow
the two threads to effectively get out of sync.

[*] https://lore.kernel.org/all/17eefa60-cf30-4830-943e-793a63d4e6f1@redhat.com
diff mbox series

Patch

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 6cbecf499767..dd2d8be390a5 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -417,7 +417,8 @@  static void dirty_ring_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
 
 static void dirty_ring_before_vcpu_join(void)
 {
-	/* Kick another round of vcpu just to make sure it will quit */
+	/* Wait vcpu exit, and let it continue to see the host_quit. */
+	dirty_ring_wait_vcpu();
 	sem_post(&sem_vcpu_cont);
 }
 
@@ -719,6 +720,7 @@  static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	unsigned long *bmap;
 	uint32_t ring_buf_idx = 0;
+	int sem_val;
 
 	if (!log_mode_supported()) {
 		print_skip("Log mode '%s' not supported",
@@ -726,6 +728,11 @@  static void run_test(enum vm_guest_mode mode, void *arg)
 		return;
 	}
 
+	sem_getvalue(&sem_vcpu_stop, &sem_val);
+	assert(sem_val == 0);
+	sem_getvalue(&sem_vcpu_cont, &sem_val);
+	assert(sem_val == 0);
+
 	/*
 	 * We reserve page table for 2 times of extra dirty mem which
 	 * will definitely cover the original (1G+) test range.  Here
@@ -825,6 +832,13 @@  static void run_test(enum vm_guest_mode mode, void *arg)
 		sync_global_to_guest(vm, iteration);
 	}
 
+	/*
+	 *
+	 * Before we set the host_quit, let the vcpu has time to run, to make
+	 * sure we consume the sem_vcpu_stop and the vcpu consume the
+	 * sem_vcpu_cont, to keep the semaphore balance.
+	 */
+	usleep(p->interval * 1000);
 	/* Tell the vcpu thread to quit */
 	host_quit = true;
 	log_mode_before_vcpu_join();