[v2] kvm/selftests: Fix race condition with dirty_log_test

Message ID	20210413213641.23742-1-peterx@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Peter Xu <peterx@redhat.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: peterx@redhat.com, Andrew Jones <drjones@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Sean Christopherson <seanjc@google.com> Subject: [PATCH v2] kvm/selftests: Fix race condition with dirty_log_test Date: Tue, 13 Apr 2021 17:36:41 -0400 Message-Id: <20210413213641.23742-1-peterx@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v2] kvm/selftests: Fix race condition with dirty_log_test \| expand [v2] kvm/selftests: Fix race condition with dirty_log_test

Message ID

20210413213641.23742-1-peterx@redhat.com (mailing list archive)

State

New, archived

Headers

From: Peter Xu <peterx@redhat.com>
To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: peterx@redhat.com, Andrew Jones <drjones@redhat.com>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Sean Christopherson <seanjc@google.com>
Subject: [PATCH v2] kvm/selftests: Fix race condition with dirty_log_test
Date: Tue, 13 Apr 2021 17:36:41 -0400
Message-Id: <20210413213641.23742-1-peterx@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

[v2] kvm/selftests: Fix race condition with dirty_log_test | expand

Commit Message

Peter Xu April 13, 2021, 9:36 p.m. UTC

This fixes a bug that can trigger with e.g. "taskset -c 0 ./dirty_log_test" or
when the testing host is very busy.

The issue is when the vcpu thread got the dirty bit set but got preempted by
other threads _before_ the data is written, we won't be able to see the latest
data only until the vcpu threads do VMENTER. IOW, the guest write operation and
dirty bit set cannot guarantee atomicity. The race could look like:

    main thread                            vcpu thread
    ===========                            ===========
    iteration=X
                                           *addr = X
                                           (so latest data is X)
    iteration=X+1
    ...
    iteration=X+N
                                           guest executes "*addr = X+N"
                                             reg=READ_ONCE(iteration)=X+N
                                             host page fault
                                               set dirty bit for page "addr"
                                             (_before_ VMENTER happens...
                                              so *addr is still X!)
                                           vcpu thread got preempted
    get dirty log
    verify data
      detected dirty bit set, data is X
      not X+N nor X+N-1, data too old!

This patch closes this race by allowing the main thread to give the vcpu thread
chance to do a VMENTER to complete that write operation.  It's done by adding a
vcpu loop counter (must be defined as volatile as main thread will do read
loop), then the main thread can guarantee the vcpu got at least another VMENTER
by making sure the guest_vcpu_loops increases by 2.

Dirty ring does not need this since dirty_ring_last_page would already help
avoid this specific race condition.

Cc: Andrew Jones <drjones@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v2:
- drop one unnecessary check on "!matched"
---
 tools/testing/selftests/kvm/dirty_log_test.c | 53 +++++++++++++++++++-
 1 file changed, 52 insertions(+), 1 deletion(-)

Comments

Andrew Jones April 14, 2021, 9:18 a.m. UTC | #1

On Tue, Apr 13, 2021 at 05:36:41PM -0400, Peter Xu wrote:
> This fixes a bug that can trigger with e.g. "taskset -c 0 ./dirty_log_test" or
> when the testing host is very busy.
> 
> The issue is when the vcpu thread got the dirty bit set but got preempted by
> other threads _before_ the data is written, we won't be able to see the latest
> data only until the vcpu threads do VMENTER. IOW, the guest write operation and
> dirty bit set cannot guarantee atomicity. The race could look like:
> 
>     main thread                            vcpu thread
>     ===========                            ===========
>     iteration=X
>                                            *addr = X
>                                            (so latest data is X)
>     iteration=X+1
>     ...
>     iteration=X+N
>                                            guest executes "*addr = X+N"
>                                              reg=READ_ONCE(iteration)=X+N
>                                              host page fault
>                                                set dirty bit for page "addr"
>                                              (_before_ VMENTER happens...
>                                               so *addr is still X!)
>                                            vcpu thread got preempted
>     get dirty log
>     verify data
>       detected dirty bit set, data is X
>       not X+N nor X+N-1, data too old!
> 
> This patch closes this race by allowing the main thread to give the vcpu thread
> chance to do a VMENTER to complete that write operation.  It's done by adding a
> vcpu loop counter (must be defined as volatile as main thread will do read
> loop), then the main thread can guarantee the vcpu got at least another VMENTER
> by making sure the guest_vcpu_loops increases by 2.
> 
> Dirty ring does not need this since dirty_ring_last_page would already help
> avoid this specific race condition.
> 
> Cc: Andrew Jones <drjones@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> v2:
> - drop one unnecessary check on "!matched"
> ---
>  tools/testing/selftests/kvm/dirty_log_test.c | 53 +++++++++++++++++++-
>  1 file changed, 52 insertions(+), 1 deletion(-)
>

Reviewed-by: Andrew Jones <drjones@redhat.com>

Paolo Bonzini April 17, 2021, 12:59 p.m. UTC | #2

On 13/04/21 23:36, Peter Xu wrote:
> This patch closes this race by allowing the main thread to give the vcpu thread
> chance to do a VMENTER to complete that write operation.  It's done by adding a
> vcpu loop counter (must be defined as volatile as main thread will do read
> loop), then the main thread can guarantee the vcpu got at least another VMENTER
> by making sure the guest_vcpu_loops increases by 2.
> 
> Dirty ring does not need this since dirty_ring_last_page would already help
> avoid this specific race condition.

Just a nit, the comment and commit message should mention KVM_RUN rather 
than vmentry; it's possible to be preempted many times in 
vcpu_enter_guest without making progress, but those wouldn't return to 
userspace and thus would not update guest_vcpu_loops.

Also, volatile is considered harmful even in userspace/test code[1]. 
Technically rather than volatile one should use an atomic load (even a 
relaxed one), but in practice it's okay to use volatile too *for this 
specific use* (READ_ONCE/WRITE_ONCE are volatile reads and writes as 
well).  If the selftests gained 32-bit support, one should not use 
volatile because neither reads or writes to uint64_t variables would be 
guaranteed to be atomic.

Queued, thanks.

Paolo

[1] Documentation/process/volatile-considered-harmful.rst

Peter Xu April 17, 2021, 2:09 p.m. UTC | #3

Paolo,

Please hold-on with this patch since I got another report that this patch can
still trigger test failure if even heavier workload (e.g., "taskset -c 0
./dirty_log_test" plus another one or multiple "taskset -c 0 while :; do:;
done").  So I plan to do it in another way.  I tested longer yesterday but
haven't updated this patch yet.  More below.

On Sat, Apr 17, 2021 at 02:59:48PM +0200, Paolo Bonzini wrote:
> On 13/04/21 23:36, Peter Xu wrote:
> > This patch closes this race by allowing the main thread to give the vcpu thread
> > chance to do a VMENTER to complete that write operation.  It's done by adding a
> > vcpu loop counter (must be defined as volatile as main thread will do read
> > loop), then the main thread can guarantee the vcpu got at least another VMENTER
> > by making sure the guest_vcpu_loops increases by 2.
> > 
> > Dirty ring does not need this since dirty_ring_last_page would already help
> > avoid this specific race condition.
> 
> Just a nit, the comment and commit message should mention KVM_RUN rather
> than vmentry; it's possible to be preempted many times in vcpu_enter_guest
> without making progress, but those wouldn't return to userspace and thus
> would not update guest_vcpu_loops.

But what I really wanted to emphasize is the vmentry point rather than KVM_RUN,
e.g., KVM_RUN can return without an vmentry, while the vmentry is the exactly
point that data will be flushed.

> 
> Also, volatile is considered harmful even in userspace/test code[1].
> Technically rather than volatile one should use an atomic load (even a
> relaxed one), but in practice it's okay to use volatile too *for this
> specific use* (READ_ONCE/WRITE_ONCE are volatile reads and writes as well).
> If the selftests gained 32-bit support, one should not use volatile because
> neither reads or writes to uint64_t variables would be guaranteed to be
> atomic.

Indeed!  I'll start to use atomics.

Regarding why this patch won't really solve all race conditions... The problem
is I think one guest memory write operation (of this specific test) contains a
few micro-steps when page is during kvm dirty tracking (here I'm only
considering write-protect rather than pml but pml should be similar at least
when the log buffer is full):

  (1) Guest read 'iteration' number into register, prepare to write, page fault
  (2) Set dirty bit in either dirty bitmap or dirty ring
  (3) Return to guest, data written

When we verify the data, we assumed that all these steps are "atomic", say,
when (1) happened for this page, we assume (2) & (3) must have happened.  We
had some trick to workaround "un-atomicity" of above three steps, as this patch
wanted to fix atomicity of step (2)+(3) by explicitly letting the main thread
wait for at least one vmenter of vcpu thread, which should work.  However what
I overlooked is probably that we still have race when (1) and (2) can be
interrupted.

As an example of how step (1) and (2) got interrupted, I simply tried to trace
kvm_vcpu_mark_page_dirty() and dump stack for vmexit cases, then we can see at
least a bunch of cases where vcpu can be scheduled out even before setting the
dirty bit:

@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    kmem_cache_alloc+583
    kvm_mmu_topup_memory_cache+33
    direct_page_fault+237
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 4
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    down_read+14
    get_user_pages_unlocked+90
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 23
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    __alloc_pages+663
    alloc_pages_vma+128
    wp_page_copy+773
    __handle_mm_fault+3155
    handle_mm_fault+151
    __get_user_pages+664
    get_user_pages_unlocked+197
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 1406
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    hva_to_pfn+157
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 2579
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    mmu_notifier_invalidate_range_start+9
    wp_page_copy+296
    __handle_mm_fault+3155
    handle_mm_fault+151
    __get_user_pages+664
    get_user_pages_unlocked+197
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 3309
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    __get_user_pages+530
    get_user_pages_unlocked+197
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 4499

It means... it can always happen that the vcpu reads a very old "iteration"
value in step 1 and it doesn't set dirty bit (step 2) or write it to memory
(step 3) until, say, 1 year later.. :) Then the verify won't pass since the
main thread iteration has been much newer, then main thread shouts at us.

So far I don't see an easy way to guarantee all steps 1-3 atomicity (as this
patch only achieved steps 2-3), but to sync at the GUEST_SYNC() point of guest
code when we do verification of the dirty bits.  Drew mentioned something like
this previously in the bugzilla, I wanted to give it a shot with a lighter
sync, but seems not working.

Paolo, Drew, Sean - feel free to shoot if any of you have a better idea.

As I mentioned I tested v2 of this patch and so far no issue found.  I'll post
it later today, so maybe we can continue discuss there too (btw, I also found
another signal race there; so I'll post a series with 2 patches).

Thanks,

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index bb2752d78fe3..b639f088bb86 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -160,6 +160,10 @@  static bool dirty_ring_vcpu_ring_full;
  * verifying process, we let it pass.
  */
 static uint64_t dirty_ring_last_page;
+/*
+ * Record how many loops the guest vcpu thread has went through
+ */
+static volatile uint64_t guest_vcpu_loops;
 
 enum log_mode_t {
 	/* Only use KVM_GET_DIRTY_LOG for logging */
@@ -526,6 +530,7 @@  static void *vcpu_worker(void *data)
 			assert(sig == SIG_IPI);
 		}
 		log_mode_after_vcpu_run(vm, ret, errno);
+		guest_vcpu_loops++;
 	}
 
 	pr_info("Dirtied %"PRIu64" pages\n", pages_count);
@@ -553,6 +558,7 @@  static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap)
 		}
 
 		if (test_and_clear_bit_le(page, bmap)) {
+			uint64_t current_loop;
 			bool matched;
 
 			host_dirty_count++;
@@ -565,7 +571,12 @@  static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap)
 			matched = (*value_ptr == iteration ||
 				   *value_ptr == iteration - 1);
 
-			if (host_log_mode == LOG_MODE_DIRTY_RING && !matched) {
+			if (matched)
+				continue;
+
+			/* Didn't match..  Let's figure out what's wrong.. */
+			switch (host_log_mode) {
+			case LOG_MODE_DIRTY_RING:
 				if (*value_ptr == iteration - 2 && min_iter <= iteration - 2) {
 					/*
 					 * Short answer: this case is special
@@ -608,6 +619,46 @@  static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap)
 					 */
 					continue;
 				}
+				break;
+			case LOG_MODE_DIRTY_LOG:
+			case LOG_MODE_CLEAR_LOG:
+				/*
+				 * This fixes a bug that can trigger with
+				 * e.g. "taskset -c 0 ./dirty_log_test" or when
+				 * the testing host is very busy, when the vcpu
+				 * thread got the dirty bit set but got
+				 * preempted by other threads _before_ the data
+				 * is written, so we won't be able to see the
+				 * latest data only until the vcpu threads do
+				 * VMENTER and the write finally lands to the
+				 * memory.  So when !matched happened, we give
+				 * the vcpu thread _one_ more chance to do a
+				 * VMENTER so as to flush the written data.  We
+				 * do that by observing guest_vcpu_loops to
+				 * increase +2: as +1 is not enough to
+				 * guarantee a complete VMENTER.
+				 *
+				 * Dirty ring does not need this since
+				 * dirty_ring_last_page would already help
+				 * avoid it.
+				 */
+				current_loop = guest_vcpu_loops;
+
+				/*
+				 * Wait until the vcpu thread at least
+				 * completes one VMENTER again.  the usleep()
+				 * gives the vcpu thread a better chance to be
+				 * scheduled earlier.
+				 */
+				while (guest_vcpu_loops <= current_loop + 2)
+					usleep(1);
+				/* Recalculate */
+				matched = (*value_ptr == iteration ||
+					   *value_ptr == iteration - 1);
+				break;
+			default:
+				/* Just to avoid some strict compile warning */
+				break;
 			}
 
 			TEST_ASSERT(matched,

[v2] kvm/selftests: Fix race condition with dirty_log_test

Commit Message

Comments

Patch