Message ID | 20210413213641.23742-1-peterx@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] kvm/selftests: Fix race condition with dirty_log_test | expand |
On Tue, Apr 13, 2021 at 05:36:41PM -0400, Peter Xu wrote: > This fixes a bug that can trigger with e.g. "taskset -c 0 ./dirty_log_test" or > when the testing host is very busy. > > The issue is when the vcpu thread got the dirty bit set but got preempted by > other threads _before_ the data is written, we won't be able to see the latest > data only until the vcpu threads do VMENTER. IOW, the guest write operation and > dirty bit set cannot guarantee atomicity. The race could look like: > > main thread vcpu thread > =========== =========== > iteration=X > *addr = X > (so latest data is X) > iteration=X+1 > ... > iteration=X+N > guest executes "*addr = X+N" > reg=READ_ONCE(iteration)=X+N > host page fault > set dirty bit for page "addr" > (_before_ VMENTER happens... > so *addr is still X!) > vcpu thread got preempted > get dirty log > verify data > detected dirty bit set, data is X > not X+N nor X+N-1, data too old! > > This patch closes this race by allowing the main thread to give the vcpu thread > chance to do a VMENTER to complete that write operation. It's done by adding a > vcpu loop counter (must be defined as volatile as main thread will do read > loop), then the main thread can guarantee the vcpu got at least another VMENTER > by making sure the guest_vcpu_loops increases by 2. > > Dirty ring does not need this since dirty_ring_last_page would already help > avoid this specific race condition. > > Cc: Andrew Jones <drjones@redhat.com> > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: Vitaly Kuznetsov <vkuznets@redhat.com> > Cc: Sean Christopherson <seanjc@google.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > v2: > - drop one unnecessary check on "!matched" > --- > tools/testing/selftests/kvm/dirty_log_test.c | 53 +++++++++++++++++++- > 1 file changed, 52 insertions(+), 1 deletion(-) > Reviewed-by: Andrew Jones <drjones@redhat.com>
On 13/04/21 23:36, Peter Xu wrote: > This patch closes this race by allowing the main thread to give the vcpu thread > chance to do a VMENTER to complete that write operation. It's done by adding a > vcpu loop counter (must be defined as volatile as main thread will do read > loop), then the main thread can guarantee the vcpu got at least another VMENTER > by making sure the guest_vcpu_loops increases by 2. > > Dirty ring does not need this since dirty_ring_last_page would already help > avoid this specific race condition. Just a nit, the comment and commit message should mention KVM_RUN rather than vmentry; it's possible to be preempted many times in vcpu_enter_guest without making progress, but those wouldn't return to userspace and thus would not update guest_vcpu_loops. Also, volatile is considered harmful even in userspace/test code[1]. Technically rather than volatile one should use an atomic load (even a relaxed one), but in practice it's okay to use volatile too *for this specific use* (READ_ONCE/WRITE_ONCE are volatile reads and writes as well). If the selftests gained 32-bit support, one should not use volatile because neither reads or writes to uint64_t variables would be guaranteed to be atomic. Queued, thanks. Paolo [1] Documentation/process/volatile-considered-harmful.rst
Paolo, Please hold-on with this patch since I got another report that this patch can still trigger test failure if even heavier workload (e.g., "taskset -c 0 ./dirty_log_test" plus another one or multiple "taskset -c 0 while :; do:; done"). So I plan to do it in another way. I tested longer yesterday but haven't updated this patch yet. More below. On Sat, Apr 17, 2021 at 02:59:48PM +0200, Paolo Bonzini wrote: > On 13/04/21 23:36, Peter Xu wrote: > > This patch closes this race by allowing the main thread to give the vcpu thread > > chance to do a VMENTER to complete that write operation. It's done by adding a > > vcpu loop counter (must be defined as volatile as main thread will do read > > loop), then the main thread can guarantee the vcpu got at least another VMENTER > > by making sure the guest_vcpu_loops increases by 2. > > > > Dirty ring does not need this since dirty_ring_last_page would already help > > avoid this specific race condition. > > Just a nit, the comment and commit message should mention KVM_RUN rather > than vmentry; it's possible to be preempted many times in vcpu_enter_guest > without making progress, but those wouldn't return to userspace and thus > would not update guest_vcpu_loops. But what I really wanted to emphasize is the vmentry point rather than KVM_RUN, e.g., KVM_RUN can return without an vmentry, while the vmentry is the exactly point that data will be flushed. > > Also, volatile is considered harmful even in userspace/test code[1]. > Technically rather than volatile one should use an atomic load (even a > relaxed one), but in practice it's okay to use volatile too *for this > specific use* (READ_ONCE/WRITE_ONCE are volatile reads and writes as well). > If the selftests gained 32-bit support, one should not use volatile because > neither reads or writes to uint64_t variables would be guaranteed to be > atomic. Indeed! I'll start to use atomics. Regarding why this patch won't really solve all race conditions... The problem is I think one guest memory write operation (of this specific test) contains a few micro-steps when page is during kvm dirty tracking (here I'm only considering write-protect rather than pml but pml should be similar at least when the log buffer is full): (1) Guest read 'iteration' number into register, prepare to write, page fault (2) Set dirty bit in either dirty bitmap or dirty ring (3) Return to guest, data written When we verify the data, we assumed that all these steps are "atomic", say, when (1) happened for this page, we assume (2) & (3) must have happened. We had some trick to workaround "un-atomicity" of above three steps, as this patch wanted to fix atomicity of step (2)+(3) by explicitly letting the main thread wait for at least one vmenter of vcpu thread, which should work. However what I overlooked is probably that we still have race when (1) and (2) can be interrupted. As an example of how step (1) and (2) got interrupted, I simply tried to trace kvm_vcpu_mark_page_dirty() and dump stack for vmexit cases, then we can see at least a bunch of cases where vcpu can be scheduled out even before setting the dirty bit: @out[ __schedule+1742 __schedule+1742 __cond_resched+52 kmem_cache_alloc+583 kvm_mmu_topup_memory_cache+33 direct_page_fault+237 kvm_mmu_page_fault+103 vmx_handle_exit+288 vcpu_enter_guest+2460 kvm_arch_vcpu_ioctl_run+325 kvm_vcpu_ioctl+526 __x64_sys_ioctl+131 do_syscall_64+51 entry_SYSCALL_64_after_hwframe+68 ]: 4 @out[ __schedule+1742 __schedule+1742 __cond_resched+52 down_read+14 get_user_pages_unlocked+90 hva_to_pfn+206 try_async_pf+132 direct_page_fault+320 kvm_mmu_page_fault+103 vmx_handle_exit+288 vcpu_enter_guest+2460 kvm_arch_vcpu_ioctl_run+325 kvm_vcpu_ioctl+526 __x64_sys_ioctl+131 do_syscall_64+51 entry_SYSCALL_64_after_hwframe+68 ]: 23 @out[ __schedule+1742 __schedule+1742 __cond_resched+52 __alloc_pages+663 alloc_pages_vma+128 wp_page_copy+773 __handle_mm_fault+3155 handle_mm_fault+151 __get_user_pages+664 get_user_pages_unlocked+197 hva_to_pfn+206 try_async_pf+132 direct_page_fault+320 kvm_mmu_page_fault+103 vmx_handle_exit+288 vcpu_enter_guest+2460 kvm_arch_vcpu_ioctl_run+325 kvm_vcpu_ioctl+526 __x64_sys_ioctl+131 do_syscall_64+51 entry_SYSCALL_64_after_hwframe+68 ]: 1406 @out[ __schedule+1742 __schedule+1742 __cond_resched+52 hva_to_pfn+157 try_async_pf+132 direct_page_fault+320 kvm_mmu_page_fault+103 vmx_handle_exit+288 vcpu_enter_guest+2460 kvm_arch_vcpu_ioctl_run+325 kvm_vcpu_ioctl+526 __x64_sys_ioctl+131 do_syscall_64+51 entry_SYSCALL_64_after_hwframe+68 ]: 2579 @out[ __schedule+1742 __schedule+1742 __cond_resched+52 mmu_notifier_invalidate_range_start+9 wp_page_copy+296 __handle_mm_fault+3155 handle_mm_fault+151 __get_user_pages+664 get_user_pages_unlocked+197 hva_to_pfn+206 try_async_pf+132 direct_page_fault+320 kvm_mmu_page_fault+103 vmx_handle_exit+288 vcpu_enter_guest+2460 kvm_arch_vcpu_ioctl_run+325 kvm_vcpu_ioctl+526 __x64_sys_ioctl+131 do_syscall_64+51 entry_SYSCALL_64_after_hwframe+68 ]: 3309 @out[ __schedule+1742 __schedule+1742 __cond_resched+52 __get_user_pages+530 get_user_pages_unlocked+197 hva_to_pfn+206 try_async_pf+132 direct_page_fault+320 kvm_mmu_page_fault+103 vmx_handle_exit+288 vcpu_enter_guest+2460 kvm_arch_vcpu_ioctl_run+325 kvm_vcpu_ioctl+526 __x64_sys_ioctl+131 do_syscall_64+51 entry_SYSCALL_64_after_hwframe+68 ]: 4499 It means... it can always happen that the vcpu reads a very old "iteration" value in step 1 and it doesn't set dirty bit (step 2) or write it to memory (step 3) until, say, 1 year later.. :) Then the verify won't pass since the main thread iteration has been much newer, then main thread shouts at us. So far I don't see an easy way to guarantee all steps 1-3 atomicity (as this patch only achieved steps 2-3), but to sync at the GUEST_SYNC() point of guest code when we do verification of the dirty bits. Drew mentioned something like this previously in the bugzilla, I wanted to give it a shot with a lighter sync, but seems not working. Paolo, Drew, Sean - feel free to shoot if any of you have a better idea. As I mentioned I tested v2 of this patch and so far no issue found. I'll post it later today, so maybe we can continue discuss there too (btw, I also found another signal race there; so I'll post a series with 2 patches). Thanks,
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c index bb2752d78fe3..b639f088bb86 100644 --- a/tools/testing/selftests/kvm/dirty_log_test.c +++ b/tools/testing/selftests/kvm/dirty_log_test.c @@ -160,6 +160,10 @@ static bool dirty_ring_vcpu_ring_full; * verifying process, we let it pass. */ static uint64_t dirty_ring_last_page; +/* + * Record how many loops the guest vcpu thread has went through + */ +static volatile uint64_t guest_vcpu_loops; enum log_mode_t { /* Only use KVM_GET_DIRTY_LOG for logging */ @@ -526,6 +530,7 @@ static void *vcpu_worker(void *data) assert(sig == SIG_IPI); } log_mode_after_vcpu_run(vm, ret, errno); + guest_vcpu_loops++; } pr_info("Dirtied %"PRIu64" pages\n", pages_count); @@ -553,6 +558,7 @@ static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap) } if (test_and_clear_bit_le(page, bmap)) { + uint64_t current_loop; bool matched; host_dirty_count++; @@ -565,7 +571,12 @@ static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap) matched = (*value_ptr == iteration || *value_ptr == iteration - 1); - if (host_log_mode == LOG_MODE_DIRTY_RING && !matched) { + if (matched) + continue; + + /* Didn't match.. Let's figure out what's wrong.. */ + switch (host_log_mode) { + case LOG_MODE_DIRTY_RING: if (*value_ptr == iteration - 2 && min_iter <= iteration - 2) { /* * Short answer: this case is special @@ -608,6 +619,46 @@ static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap) */ continue; } + break; + case LOG_MODE_DIRTY_LOG: + case LOG_MODE_CLEAR_LOG: + /* + * This fixes a bug that can trigger with + * e.g. "taskset -c 0 ./dirty_log_test" or when + * the testing host is very busy, when the vcpu + * thread got the dirty bit set but got + * preempted by other threads _before_ the data + * is written, so we won't be able to see the + * latest data only until the vcpu threads do + * VMENTER and the write finally lands to the + * memory. So when !matched happened, we give + * the vcpu thread _one_ more chance to do a + * VMENTER so as to flush the written data. We + * do that by observing guest_vcpu_loops to + * increase +2: as +1 is not enough to + * guarantee a complete VMENTER. + * + * Dirty ring does not need this since + * dirty_ring_last_page would already help + * avoid it. + */ + current_loop = guest_vcpu_loops; + + /* + * Wait until the vcpu thread at least + * completes one VMENTER again. the usleep() + * gives the vcpu thread a better chance to be + * scheduled earlier. + */ + while (guest_vcpu_loops <= current_loop + 2) + usleep(1); + /* Recalculate */ + matched = (*value_ptr == iteration || + *value_ptr == iteration - 1); + break; + default: + /* Just to avoid some strict compile warning */ + break; } TEST_ASSERT(matched,
This fixes a bug that can trigger with e.g. "taskset -c 0 ./dirty_log_test" or when the testing host is very busy. The issue is when the vcpu thread got the dirty bit set but got preempted by other threads _before_ the data is written, we won't be able to see the latest data only until the vcpu threads do VMENTER. IOW, the guest write operation and dirty bit set cannot guarantee atomicity. The race could look like: main thread vcpu thread =========== =========== iteration=X *addr = X (so latest data is X) iteration=X+1 ... iteration=X+N guest executes "*addr = X+N" reg=READ_ONCE(iteration)=X+N host page fault set dirty bit for page "addr" (_before_ VMENTER happens... so *addr is still X!) vcpu thread got preempted get dirty log verify data detected dirty bit set, data is X not X+N nor X+N-1, data too old! This patch closes this race by allowing the main thread to give the vcpu thread chance to do a VMENTER to complete that write operation. It's done by adding a vcpu loop counter (must be defined as volatile as main thread will do read loop), then the main thread can guarantee the vcpu got at least another VMENTER by making sure the guest_vcpu_loops increases by 2. Dirty ring does not need this since dirty_ring_last_page would already help avoid this specific race condition. Cc: Andrew Jones <drjones@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- v2: - drop one unnecessary check on "!matched" --- tools/testing/selftests/kvm/dirty_log_test.c | 53 +++++++++++++++++++- 1 file changed, 52 insertions(+), 1 deletion(-)