diff mbox series

KVM: selftests: Wait mprotect_ro_done before write to RO in mmu_stress_test

Message ID 20250208105318.16861-1-yan.y.zhao@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: selftests: Wait mprotect_ro_done before write to RO in mmu_stress_test | expand

Commit Message

Yan Zhao Feb. 8, 2025, 10:53 a.m. UTC
In the read-only mprotect() phase of mmu_stress_test, ensure that
mprotect(PROT_READ) has completed before the guest starts writing to the
read-only mprotect() memory.

Without waiting for mprotect_ro_done before the guest starts writing in
stage 3 (the stage for read-only mprotect()), the host's assertion of stage
3 could fail if mprotect_ro_done is set to true in the window between the
guest finishing writes to all GPAs and executing GUEST_SYNC(3).

This scenario is easy to occur especially when there are hundred of vCPUs.

CPU 0                  CPU 1 guest     CPU 1 host
                                       enter stage 3's 1st loop
                       //in stage 3
                       write all GPAs
                       @rip 0x4025f0

mprotect(PROT_READ)
mprotect_ro_done=true
                       GUEST_SYNC(3)
                                       r=0, continue stage 3's 1st loop

                       //in stage 4
                       write GPA
                       @rip 0x402635

                                       -EFAULT, jump out stage 3's 1st loop
                                       enter stage 3's 2nd loop
                       write GPA
                       @rip 0x402635
                                       -EFAULT, continue stage 3's 2nd loop
                                       guest rip += 3

The test then fails and reports "Unhandled exception '0xe' at guest RIP
'0x402638'", since the next valid guest rip address is 0x402639, i.e. the
"(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
of length 4.

Even if it could be compiled into a mov instruction of length 3, the
following execution of GUEST_SYNC(4) in guest could still cause the host
failure of the assertion of stage 3.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 tools/testing/selftests/kvm/mmu_stress_test.c | 25 +++++++++++--------
 1 file changed, 14 insertions(+), 11 deletions(-)

Comments

Edgecombe, Rick P Feb. 11, 2025, 1:03 a.m. UTC | #1
On Sat, 2025-02-08 at 18:53 +0800, Yan Zhao wrote:
> In the read-only mprotect() phase of mmu_stress_test, ensure that
> mprotect(PROT_READ) has completed before the guest starts writing to the
> read-only mprotect() memory.

Is this a fix for the intermittent failure we saw on the v6.13-rc3 based kvm
branch? Funnily, I can't seem to reproduce it anymore, with or without this fix.

On the fix though, doesn't this remove the coverage of writing to a region that
is in the process of being made RO? I'm thinking about warnings, etc that may
trigger intermittently based on bugs with a race component. I don't know if we
could fix the test and still leave the write while the "mprotect(PROT_READ) is
underway". It seems to be deliberate.
Yan Zhao Feb. 11, 2025, 1:42 a.m. UTC | #2
On Tue, Feb 11, 2025 at 09:03:08AM +0800, Edgecombe, Rick P wrote:
> On Sat, 2025-02-08 at 18:53 +0800, Yan Zhao wrote:
> > In the read-only mprotect() phase of mmu_stress_test, ensure that
> > mprotect(PROT_READ) has completed before the guest starts writing to the
> > read-only mprotect() memory.
> 
> Is this a fix for the intermittent failure we saw on the v6.13-rc3 based kvm
> branch? Funnily, I can't seem to reproduce it anymore, with or without this fix.
Hmm, it can be reproduced in my SPR (non TDX) almost every time.
It depends on the timing when mprotect(PROT_READ) is completed done.

Attached the detailed error log in my machine at the bottom.
 
> On the fix though, doesn't this remove the coverage of writing to a region that
> is in the process of being made RO? I'm thinking about warnings, etc that may
> trigger intermittently based on bugs with a race component. I don't know if we
> could fix the test and still leave the write while the "mprotect(PROT_READ) is
> underway". It seems to be deliberate.
Write before "mprotect(PROT_READ)" has been tested in stage 0.
Not sure it's deliberate to test write in the process of being made RO.
If it is, maybe we could make the fix by writing to RO memory a second time
after mprotect_ro_done is true:

--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -47,17 +47,18 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
         * is low in this case).  For x86, hand-code the exact opcode so that
         * there is no room for variability in the generated instruction.
         */
-       do {
-               for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+       for (i = 0; i < 2; i++) {
+               do {
+                       for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
-                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
+                               asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
 #elif defined(__aarch64__)
                        asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
 #else
                        vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
 #endif
-       } while (!READ_ONCE(mprotect_ro_done));
-
+               } while (!READ_ONCE(mprotect_ro_done));
+       }
        /*
         * Only architectures that write the entire range can explicitly sync,
         * as other architectures will be stuck on the write fault.


Orig error log:
[root@984fee00a151 kvm]# ./mmu_stress_test
Random seed: 0x6b8b4567
Running with 128gb of guest memory and 168 vCPUs
Waiting for vCPUs to finish spawning...
All vCPUs finished spawning, releasing...
Waiting for vCPUs to finish run 1...
All vCPUs finished run 1, releasing...
Waiting for vCPUs to finish reset...
All vCPUs finished reset, releasing...
Waiting for vCPUs to finish run 2...
All vCPUs finished run 2, releasing...
Waiting for vCPUs to finish mprotect RO...
168 vCPUs haven't rendezvoused...==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441760 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441782 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441876 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441825 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441905 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441780 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441888 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441834 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441832 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441880 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441792 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441787 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441804 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441784 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441781 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441868 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441902 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441806 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441842 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441844 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441848 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441860 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441766 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441786 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441794 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441763 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441850 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441772 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441818 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441877 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441846 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441768 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441754 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441757 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441800 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441833 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441758 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441841 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441853 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441909 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441862 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441771 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441883 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441799 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441847 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441855 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441793 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441755 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441797 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441805 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441835 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441861 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441863 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441815 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441893 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441776 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441871 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441864 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441899 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441845 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441778 errno=4 - Interrupted system call
==== Test Assertion Failure ====
  lib/x86/processor.c:582: Unconditional guest failure
  pid=441740 tid=441897 errno=4 - Interrupted system call
168 vCPUs haven't rendezvoused...     1 0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
168 vCPUs haven't rendezvoused...    Unhandled exception '0xe' at guest RIP '0x402638'Unhandled exception '0xe' at guest RIP '0x402638'     1   0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
168 vCPUs haven't rendezvoused...     1 0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
  Unhandled exception '0xe' at guest RIP '0x402638'
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: assert_on_unhandled_exception at processor.c:625
addr2line: '/proc/441740/exe': No such file
     2  0x000000000040552d: _vcpu_run at kvm_util.c:1652
     3  0x0000000000402a10: vcpu_worker at mmu_stress_test.c:168
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: ?? ??:0
     2  0x000000000040552d: ?? ??:0
     3  0x0000000000402a10: ?? ??:0
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: ?? ??:0
     2  0x000000000040552d: ?? ??:0
     3  0x0000000000402a10: ?? ??:0
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: ?? ??:0
     2  0x000000000040552d: ?? ??:0
     3  0x0000000000402a10: ?? ??:0
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: ?? ??:0
     2  0x000000000040552d: ?? ??:0
     3  0x0000000000402a10: ?? ??:0
     4  0x00007f67fec08179: ?? ??:0
     5  0x00007f67fe8fcdc2: ?? ??:0
     1  0x000000000040fd01: ?? ??:0
     2  0x000000000040552d: ?? ??:0
     3  0x0000000000402a10: ?? ??:0
     4  0x00007f67fec08179: ?? ??:0
Edgecombe, Rick P Feb. 11, 2025, 6:29 p.m. UTC | #3
On Tue, 2025-02-11 at 09:42 +0800, Yan Zhao wrote:
> > Is this a fix for the intermittent failure we saw on the v6.13-rc3 based kvm
> > branch? Funnily, I can't seem to reproduce it anymore, with or without this
> > fix.
> Hmm, it can be reproduced in my SPR (non TDX) almost every time.
> It depends on the timing when mprotect(PROT_READ) is completed done.
> 
> Attached the detailed error log in my machine at the bottom.

I must be getting lucky on timing. BTW, in the above I actually meant on either
the new or old *kernel*.

>  
> > On the fix though, doesn't this remove the coverage of writing to a region
> > that
> > is in the process of being made RO? I'm thinking about warnings, etc that
> > may
> > trigger intermittently based on bugs with a race component. I don't know if
> > we
> > could fix the test and still leave the write while the "mprotect(PROT_READ)
> > is
> > underway". It seems to be deliberate.
> Write before "mprotect(PROT_READ)" has been tested in stage 0.
> Not sure it's deliberate to test write in the process of being made RO.
> If it is, maybe we could make the fix by writing to RO memory a second time
> after mprotect_ro_done is true:

That could work if it's desirable to maintain the testing. I would mention the
reduced scope in the log at least. Maybe Sean will chime in.

Also, I think it needs:

Fixes: b6c304aec648 ("KVM: selftests: Verify KVM correctly handles
mprotect(PROT_READ)")
Yan Zhao Feb. 12, 2025, 6:59 a.m. UTC | #4
On Wed, Feb 12, 2025 at 02:29:26AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-02-11 at 09:42 +0800, Yan Zhao wrote:
> > > Is this a fix for the intermittent failure we saw on the v6.13-rc3 based kvm
> > > branch? Funnily, I can't seem to reproduce it anymore, with or without this
> > > fix.
> > Hmm, it can be reproduced in my SPR (non TDX) almost every time.
> > It depends on the timing when mprotect(PROT_READ) is completed done.
> > 
> > Attached the detailed error log in my machine at the bottom.
> 
> I must be getting lucky on timing. BTW, in the above I actually meant on either
It's also not easy for me to reproduce it in my TDX platform.
While the reproducing rate is around 100% in my non-TDX SPR machine, if I add a
minor delay in the guest (e.g. printing the value of mprotect_ro_done in the
beginning of stage 3), the test can also get passed.

> the new or old *kernel*.
Hmm, the test fails in my platform as soon as b6c304aec648 ("KVM: selftests:
Verify KVM correctly handles mprotect(PROT_READ)")" is introduced, which is from
6.13.0-rc2.
 
> >  
> > > On the fix though, doesn't this remove the coverage of writing to a region
> > > that
> > > is in the process of being made RO? I'm thinking about warnings, etc that
> > > may
> > > trigger intermittently based on bugs with a race component. I don't know if
> > > we
> > > could fix the test and still leave the write while the "mprotect(PROT_READ)
> > > is
> > > underway". It seems to be deliberate.
> > Write before "mprotect(PROT_READ)" has been tested in stage 0.
> > Not sure it's deliberate to test write in the process of being made RO.
> > If it is, maybe we could make the fix by writing to RO memory a second time
> > after mprotect_ro_done is true:
> 
> That could work if it's desirable to maintain the testing. I would mention the
> reduced scope in the log at least. Maybe Sean will chime in.
Not really a reduced scope.

Before this patch, it's also not guaranteed for the memory writes to occur
during the mprotect(PROT_READ) transition. i.e. there are 3 possibilities:
1. all writes occur before mprotect(PROT_READ) takes effect.
2. all writes occur after mprotect(PROT_READ) takes effect.
3. some writes occur before mprotect(PROT_READ) takes effect and some occur
   after.

case 3 is not guaranteed without introducing another synchronization flag.

That said, I'm not sure if invoking writes before mprotect_ro_done being read
as true is indeed necessary.

But I'm fine with either way:
1. make sure all writes are after mprotect_ro_done is true (as in this patch).
2. call the do-while loop twice to ensure some writes must happen after
   mprotect_ro_done is true

> Also, I think it needs:
> 
> Fixes: b6c304aec648 ("KVM: selftests: Verify KVM correctly handles
> mprotect(PROT_READ)")
Will add it in v2!
Sean Christopherson Feb. 26, 2025, 1:48 a.m. UTC | #5
On Sat, Feb 08, 2025, Yan Zhao wrote:
> In the read-only mprotect() phase of mmu_stress_test, ensure that
> mprotect(PROT_READ) has completed before the guest starts writing to the
> read-only mprotect() memory.
> 
> Without waiting for mprotect_ro_done before the guest starts writing in
> stage 3 (the stage for read-only mprotect()), the host's assertion of stage
> 3 could fail if mprotect_ro_done is set to true in the window between the
> guest finishing writes to all GPAs and executing GUEST_SYNC(3).
> 
> This scenario is easy to occur especially when there are hundred of vCPUs.
> 
> CPU 0                  CPU 1 guest     CPU 1 host
>                                        enter stage 3's 1st loop
>                        //in stage 3
>                        write all GPAs
>                        @rip 0x4025f0
> 
> mprotect(PROT_READ)
> mprotect_ro_done=true
>                        GUEST_SYNC(3)
>                                        r=0, continue stage 3's 1st loop
> 
>                        //in stage 4
>                        write GPA
>                        @rip 0x402635
> 
>                                        -EFAULT, jump out stage 3's 1st loop
>                                        enter stage 3's 2nd loop
>                        write GPA
>                        @rip 0x402635
>                                        -EFAULT, continue stage 3's 2nd loop
>                                        guest rip += 3
> 
> The test then fails and reports "Unhandled exception '0xe' at guest RIP
> '0x402638'", since the next valid guest rip address is 0x402639, i.e. the
> "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
> of length 4.

This shouldn't happen.  On x86, stage 3 is a hand-coded "mov %rax, (%rax)", not
vcpu_arch_put_guest().  Either something else is going on, or __x86_64__ isn't
defined?

	do {
		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
#ifdef __x86_64__
			asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
#elif defined(__aarch64__)
			asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
#else
			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
#endif
	} while (!READ_ONCE(mprotect_ro_done));

	/*
	 * Only architectures that write the entire range can explicitly sync,
	 * as other architectures will be stuck on the write fault.
	 */
#if defined(__x86_64__) || defined(__aarch64__)
	GUEST_SYNC(3);
#endif

	for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
		vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
	GUEST_SYNC(4);



> Even if it could be compiled into a mov instruction of length 3, the
> following execution of GUEST_SYNC(4) in guest could still cause the host
> failure of the assertion of stage 3.

Sorry, I don't follow.  The guest doesn't get "released" from GUEST_SYNC(3) until
the host runs the vCPU again, and that happens after asserting on stage 3 and
synchronizing with the main thread.

	assert_sync_stage(vcpu, 3);
#endif /* __x86_64__ || __aarch64__ */
	rendezvous_with_boss();

	/*
	 * Stage 4.  Run to completion, waiting for mprotect(PROT_WRITE) to
	 * make the memory writable again.
	 */
	do {
		r = _vcpu_run(vcpu);
	} while (r && errno == EFAULT);
Sean Christopherson Feb. 26, 2025, 1:53 a.m. UTC | #6
On Tue, Feb 11, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-02-11 at 09:42 +0800, Yan Zhao wrote:
> > > On the fix though, doesn't this remove the coverage of writing to a
> > > region that is in the process of being made RO? I'm thinking about
> > > warnings, etc that may trigger intermittently based on bugs with a race
> > > component. I don't know if we could fix the test and still leave the
> > > write while the "mprotect(PROT_READ) is underway". It seems to be
> > > deliberate.
> > Write before "mprotect(PROT_READ)" has been tested in stage 0.
> > Not sure it's deliberate to test write in the process of being made RO.

Writing while VMAs are being made RO is 100% intended.  The goal is to stress
KVM's interactions with the mmu_notifier, and to verify KVM delivers -EFAULT to
userspace.

Something isn't quite right in the original analysis.  We need to drill down on
that before change anything.

FWIW, I run this test frequently on large systems and have never observed failures.
Maybe Rick and I should go buy lottery tickets?
Yan Zhao Feb. 26, 2025, 2:07 a.m. UTC | #7
On Tue, Feb 25, 2025 at 05:53:49PM -0800, Sean Christopherson wrote:
> On Tue, Feb 11, 2025, Rick P Edgecombe wrote:
> > On Tue, 2025-02-11 at 09:42 +0800, Yan Zhao wrote:
> > > > On the fix though, doesn't this remove the coverage of writing to a
> > > > region that is in the process of being made RO? I'm thinking about
> > > > warnings, etc that may trigger intermittently based on bugs with a race
> > > > component. I don't know if we could fix the test and still leave the
> > > > write while the "mprotect(PROT_READ) is underway". It seems to be
> > > > deliberate.
> > > Write before "mprotect(PROT_READ)" has been tested in stage 0.
> > > Not sure it's deliberate to test write in the process of being made RO.
> 
> Writing while VMAs are being made RO is 100% intended.  The goal is to stress
> KVM's interactions with the mmu_notifier, and to verify KVM delivers -EFAULT to
> userspace.
> 
> Something isn't quite right in the original analysis.  We need to drill down on
> that before change anything.
> 
> FWIW, I run this test frequently on large systems and have never observed failures.
Could you try adding CONFIG_LOCK_STAT=y?

With this config, the failure rate is more than 90% in my SPR non-TDX machine,
and 20%~80% in my TDX machine.

> Maybe Rick and I should go buy lottery tickets?
Sean Christopherson Feb. 26, 2025, 2:20 a.m. UTC | #8
On Wed, Feb 26, 2025, Yan Zhao wrote:
> On Tue, Feb 25, 2025 at 05:53:49PM -0800, Sean Christopherson wrote:
> > On Tue, Feb 11, 2025, Rick P Edgecombe wrote:
> > > On Tue, 2025-02-11 at 09:42 +0800, Yan Zhao wrote:
> > > > > On the fix though, doesn't this remove the coverage of writing to a
> > > > > region that is in the process of being made RO? I'm thinking about
> > > > > warnings, etc that may trigger intermittently based on bugs with a race
> > > > > component. I don't know if we could fix the test and still leave the
> > > > > write while the "mprotect(PROT_READ) is underway". It seems to be
> > > > > deliberate.
> > > > Write before "mprotect(PROT_READ)" has been tested in stage 0.
> > > > Not sure it's deliberate to test write in the process of being made RO.
> > 
> > Writing while VMAs are being made RO is 100% intended.  The goal is to stress
> > KVM's interactions with the mmu_notifier, and to verify KVM delivers -EFAULT to
> > userspace.
> > 
> > Something isn't quite right in the original analysis.  We need to drill down on
> > that before change anything.
> > 
> > FWIW, I run this test frequently on large systems and have never observed failures.
> Could you try adding CONFIG_LOCK_STAT=y?

Will do, though it'll probably be a few days before I can take a look.

> With this config, the failure rate is more than 90% in my SPR non-TDX machine,
> and 20%~80% in my TDX machine.
Yan Zhao Feb. 26, 2025, 3:08 a.m. UTC | #9
On Tue, Feb 25, 2025 at 05:48:39PM -0800, Sean Christopherson wrote:
> On Sat, Feb 08, 2025, Yan Zhao wrote:
> > In the read-only mprotect() phase of mmu_stress_test, ensure that
> > mprotect(PROT_READ) has completed before the guest starts writing to the
> > read-only mprotect() memory.
> > 
> > Without waiting for mprotect_ro_done before the guest starts writing in
> > stage 3 (the stage for read-only mprotect()), the host's assertion of stage
> > 3 could fail if mprotect_ro_done is set to true in the window between the
> > guest finishing writes to all GPAs and executing GUEST_SYNC(3).
> > 
> > This scenario is easy to occur especially when there are hundred of vCPUs.
> > 
> > CPU 0                  CPU 1 guest     CPU 1 host
> >                                        enter stage 3's 1st loop
> >                        //in stage 3
> >                        write all GPAs
> >                        @rip 0x4025f0
> > 
> > mprotect(PROT_READ)
> > mprotect_ro_done=true
> >                        GUEST_SYNC(3)
> >                                        r=0, continue stage 3's 1st loop
> > 
> >                        //in stage 4
> >                        write GPA
> >                        @rip 0x402635
> > 
> >                                        -EFAULT, jump out stage 3's 1st loop
> >                                        enter stage 3's 2nd loop
> >                        write GPA
> >                        @rip 0x402635
> >                                        -EFAULT, continue stage 3's 2nd loop
> >                                        guest rip += 3
> > 
> > The test then fails and reports "Unhandled exception '0xe' at guest RIP
> > '0x402638'", since the next valid guest rip address is 0x402639, i.e. the
> > "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
> > of length 4.
> 
> This shouldn't happen.  On x86, stage 3 is a hand-coded "mov %rax, (%rax)", not
> vcpu_arch_put_guest().  Either something else is going on, or __x86_64__ isn't
> defined?
stage 3 is hand-coded "mov %rax, (%rax)", but stage 4 is with
vcpu_arch_put_guest().

The original code expects that "mov %rax, (%rax)" in stage 3 can produce
-EFAULT, so that in the host thread can jump out of stage 3's 1st vcpu_run()
loop.

	/*
         * Stage 3, write guest memory and verify KVM returns -EFAULT for once
         * the mprotect(PROT_READ) lands.  Only architectures that support
         * validating *all* of guest memory sync for this stage, as vCPUs will
         * be stuck on the faulting instruction for other architectures.  Go to
         * stage 3 without a rendezvous
         */
        do {
                r = _vcpu_run(vcpu);
        } while (!r);
        TEST_ASSERT(r == -1 && errno == EFAULT,
                    "Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno);


Then, in host stage 3's 2st vcpu_run() loop, rip += 3 is performed to skip
"mov %rax, (%rax)" in guest.


        for (;;) {                                                                             
                r = _vcpu_run(vcpu);                                                           
                if (!r)                                                                        
                        break;                                                   
                TEST_ASSERT_EQ(errno, EFAULT);                                   
#if defined(__x86_64__)                                                          
                WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS);        
                vcpu->run->s.regs.regs.rip += 3;                                 
#elif defined(__aarch64__)                                                       
                vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc),                      
                             vcpu_get_reg(vcpu, ARM64_CORE_REG(regs.pc)) + 4);   
#endif                                                                           
                                                                                 
        } 

Finally, guest can call GUEST_SYNC(3) to let host jump out of the 2nd vcpu_run()
loop and host assert_sync_stage(vcpu, 3).


However, there're chances that all "mov %rax, (%rax)" in stage 3 does not cause
any -EFAULT. The guest then leaves stage 3 after finding mprotect_ro_done=true.

GUEST_SYNC(3) causes r=0, so host continues stage 3's first vcpu_run() loop.

Then mprotect(PROT_READ) takes effect after the guest enters stage 4.

vcpu_arch_put_guest() in guest stage 4 produces -EFAULT and host jumps out of
stage 3's first vcpu_run() loop.
The rip+=3 in host stage 3's second vcpu_run() loop does not match
vcpu_arch_put_guest(), producing the "Unhandled exception '0xe'" error.


> 
> 	do {
> 		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> #ifdef __x86_64__
> 			asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
> #elif defined(__aarch64__)
> 			asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
> #else
> 			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> #endif
> 	} while (!READ_ONCE(mprotect_ro_done));
> 
> 	/*
> 	 * Only architectures that write the entire range can explicitly sync,
> 	 * as other architectures will be stuck on the write fault.
> 	 */
> #if defined(__x86_64__) || defined(__aarch64__)
> 	GUEST_SYNC(3);
> #endif
> 
> 	for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> 		vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> 	GUEST_SYNC(4);
> 
> 
> 
> > Even if it could be compiled into a mov instruction of length 3, the
> > following execution of GUEST_SYNC(4) in guest could still cause the host
> > failure of the assertion of stage 3.
> 
> Sorry, I don't follow.  The guest doesn't get "released" from GUEST_SYNC(3) until
> the host runs the vCPU again, and that happens after asserting on stage 3 and
> synchronizing with the main thread.


//guest stage 3 
do {
    for (...)  
    	mov %rax, (%rax)  3.1 ==> host mprotect(PROT_READ) does not yet
	                          take effect, mprotect_ro_done=false

} while (!READ_ONCE(mprotect_ro_done)); ==> 3.2 host mprotect(PROT_READ) takes
                                            effect here. mprotect_ro_done=true

GUEST_SYNC(3);                ==> host stage 3's vcpu_run() returns 0. so host
                                  is still in stage 3's first vcpu_run() loop


//guest stage 4. with host mprotect(PROT_READ) in effect, vcpu_arch_put_guest()
  will cause -EFAULT. host enters host stage 3's second vcpu_run() loop.

for (...)
    vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa); ==> wrong for rip += 3

GUEST_SYNC(4); ==> since host still in stage 3, even if we change
                   vcpu_arch_put_guest() in guest stage 4 to "mov %rax, (%rax)",
		   this cannot pass host's assert_sync_stage(vcpu, 3);

> 
> 	assert_sync_stage(vcpu, 3);
> #endif /* __x86_64__ || __aarch64__ */
> 	rendezvous_with_boss();
> 
> 	/*
> 	 * Stage 4.  Run to completion, waiting for mprotect(PROT_WRITE) to
> 	 * make the memory writable again.
> 	 */
> 	do {
> 		r = _vcpu_run(vcpu);
> 	} while (r && errno == EFAULT);
Yan Zhao Feb. 26, 2025, 3:09 a.m. UTC | #10
On Tue, Feb 25, 2025 at 06:20:02PM -0800, Sean Christopherson wrote:
> On Wed, Feb 26, 2025, Yan Zhao wrote:
> > On Tue, Feb 25, 2025 at 05:53:49PM -0800, Sean Christopherson wrote:
> > > On Tue, Feb 11, 2025, Rick P Edgecombe wrote:
> > > > On Tue, 2025-02-11 at 09:42 +0800, Yan Zhao wrote:
> > > > > > On the fix though, doesn't this remove the coverage of writing to a
> > > > > > region that is in the process of being made RO? I'm thinking about
> > > > > > warnings, etc that may trigger intermittently based on bugs with a race
> > > > > > component. I don't know if we could fix the test and still leave the
> > > > > > write while the "mprotect(PROT_READ) is underway". It seems to be
> > > > > > deliberate.
> > > > > Write before "mprotect(PROT_READ)" has been tested in stage 0.
> > > > > Not sure it's deliberate to test write in the process of being made RO.
> > > 
> > > Writing while VMAs are being made RO is 100% intended.  The goal is to stress
> > > KVM's interactions with the mmu_notifier, and to verify KVM delivers -EFAULT to
> > > userspace.
> > > 
> > > Something isn't quite right in the original analysis.  We need to drill down on
> > > that before change anything.
> > > 
> > > FWIW, I run this test frequently on large systems and have never observed failures.
> > Could you try adding CONFIG_LOCK_STAT=y?
> 
> Will do, though it'll probably be a few days before I can take a look.
No problem.

> > With this config, the failure rate is more than 90% in my SPR non-TDX machine,
> > and 20%~80% in my TDX machine.
Sean Christopherson Feb. 26, 2025, 7:30 p.m. UTC | #11
On Wed, Feb 26, 2025, Yan Zhao wrote:
> On Tue, Feb 25, 2025 at 05:48:39PM -0800, Sean Christopherson wrote:
> > On Sat, Feb 08, 2025, Yan Zhao wrote:
> > > In the read-only mprotect() phase of mmu_stress_test, ensure that
> > > mprotect(PROT_READ) has completed before the guest starts writing to the
> > > read-only mprotect() memory.
> > > 
> > > Without waiting for mprotect_ro_done before the guest starts writing in
> > > stage 3 (the stage for read-only mprotect()), the host's assertion of stage
> > > 3 could fail if mprotect_ro_done is set to true in the window between the
> > > guest finishing writes to all GPAs and executing GUEST_SYNC(3).
> > > 
> > > This scenario is easy to occur especially when there are hundred of vCPUs.
> > > 
> > > CPU 0                  CPU 1 guest     CPU 1 host
> > >                                        enter stage 3's 1st loop
> > >                        //in stage 3
> > >                        write all GPAs
> > >                        @rip 0x4025f0
> > > 
> > > mprotect(PROT_READ)
> > > mprotect_ro_done=true
> > >                        GUEST_SYNC(3)
> > >                                        r=0, continue stage 3's 1st loop
> > > 
> > >                        //in stage 4
> > >                        write GPA
> > >                        @rip 0x402635
> > > 
> > >                                        -EFAULT, jump out stage 3's 1st loop
> > >                                        enter stage 3's 2nd loop
> > >                        write GPA
> > >                        @rip 0x402635
> > >                                        -EFAULT, continue stage 3's 2nd loop
> > >                                        guest rip += 3
> > > 
> > > The test then fails and reports "Unhandled exception '0xe' at guest RIP
> > > '0x402638'", since the next valid guest rip address is 0x402639, i.e. the
> > > "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
> > > of length 4.
> > 
> > This shouldn't happen.  On x86, stage 3 is a hand-coded "mov %rax, (%rax)", not
> > vcpu_arch_put_guest().  Either something else is going on, or __x86_64__ isn't
> > defined?
> stage 3 is hand-coded "mov %rax, (%rax)", but stage 4 is with
> vcpu_arch_put_guest().
> 
> The original code expects that "mov %rax, (%rax)" in stage 3 can produce
> -EFAULT, so that in the host thread can jump out of stage 3's 1st vcpu_run()
> loop.

Ugh, I forgot that there are two loops in stage-3.  I tried to prevent this race,
but violated my own rule of not using arbitrary delays to avoid races.

Completely untested, but I think this should address the problem (I'll test
later today; you already did the hard work of debugging).  The only thing I'm
not positive is correct is making the first _vcpu_run() a one-off instead of a
loop.

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index d9c76b4c0d88..9ac1800bb770 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -18,6 +18,7 @@
 #include "ucall_common.h"
 
 static bool mprotect_ro_done;
+static bool vcpu_hit_ro_fault;
 
 static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 {
@@ -36,9 +37,9 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 
        /*
         * Write to the region while mprotect(PROT_READ) is underway.  Keep
-        * looping until the memory is guaranteed to be read-only, otherwise
-        * vCPUs may complete their writes and advance to the next stage
-        * prematurely.
+        * looping until the memory is guaranteed to be read-only and a fault
+        * has occured, otherwise vCPUs may complete their writes and advance
+        * to the next stage prematurely.
         *
         * For architectures that support skipping the faulting instruction,
         * generate the store via inline assembly to ensure the exact length
@@ -56,7 +57,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 #else
                        vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
 #endif
-       } while (!READ_ONCE(mprotect_ro_done));
+       } while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(vcpu_hit_ro_fault));
 
        /*
         * Only architectures that write the entire range can explicitly sync,
@@ -148,12 +149,13 @@ static void *vcpu_worker(void *data)
         * be stuck on the faulting instruction for other architectures.  Go to
         * stage 3 without a rendezvous
         */
-       do {
-               r = _vcpu_run(vcpu);
-       } while (!r);
+       r = _vcpu_run(vcpu);
        TEST_ASSERT(r == -1 && errno == EFAULT,
                    "Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno);
 
+       /* Tell the vCPU it hit a RO fault. */
+       WRITE_ONCE(vcpu_hit_ro_fault, true);
+
 #if defined(__x86_64__) || defined(__aarch64__)
        /*
         * Verify *all* writes from the guest hit EFAULT due to the VMA now
@@ -378,7 +380,6 @@ int main(int argc, char *argv[])
        rendezvous_with_vcpus(&time_run2, "run 2");
 
        mprotect(mem, slot_size, PROT_READ);
-       usleep(10);
        mprotect_ro_done = true;
        sync_global_to_guest(vm, mprotect_ro_done);
Yan Zhao Feb. 27, 2025, 5:45 a.m. UTC | #12
On Wed, Feb 26, 2025 at 11:30:15AM -0800, Sean Christopherson wrote:
> On Wed, Feb 26, 2025, Yan Zhao wrote:
> > On Tue, Feb 25, 2025 at 05:48:39PM -0800, Sean Christopherson wrote:
> > > On Sat, Feb 08, 2025, Yan Zhao wrote:
> > > > In the read-only mprotect() phase of mmu_stress_test, ensure that
> > > > mprotect(PROT_READ) has completed before the guest starts writing to the
> > > > read-only mprotect() memory.
> > > > 
> > > > Without waiting for mprotect_ro_done before the guest starts writing in
> > > > stage 3 (the stage for read-only mprotect()), the host's assertion of stage
> > > > 3 could fail if mprotect_ro_done is set to true in the window between the
> > > > guest finishing writes to all GPAs and executing GUEST_SYNC(3).
> > > > 
> > > > This scenario is easy to occur especially when there are hundred of vCPUs.
> > > > 
> > > > CPU 0                  CPU 1 guest     CPU 1 host
> > > >                                        enter stage 3's 1st loop
> > > >                        //in stage 3
> > > >                        write all GPAs
> > > >                        @rip 0x4025f0
> > > > 
> > > > mprotect(PROT_READ)
> > > > mprotect_ro_done=true
> > > >                        GUEST_SYNC(3)
> > > >                                        r=0, continue stage 3's 1st loop
> > > > 
> > > >                        //in stage 4
> > > >                        write GPA
> > > >                        @rip 0x402635
> > > > 
> > > >                                        -EFAULT, jump out stage 3's 1st loop
> > > >                                        enter stage 3's 2nd loop
> > > >                        write GPA
> > > >                        @rip 0x402635
> > > >                                        -EFAULT, continue stage 3's 2nd loop
> > > >                                        guest rip += 3
> > > > 
> > > > The test then fails and reports "Unhandled exception '0xe' at guest RIP
> > > > '0x402638'", since the next valid guest rip address is 0x402639, i.e. the
> > > > "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
> > > > of length 4.
> > > 
> > > This shouldn't happen.  On x86, stage 3 is a hand-coded "mov %rax, (%rax)", not
> > > vcpu_arch_put_guest().  Either something else is going on, or __x86_64__ isn't
> > > defined?
> > stage 3 is hand-coded "mov %rax, (%rax)", but stage 4 is with
> > vcpu_arch_put_guest().
> > 
> > The original code expects that "mov %rax, (%rax)" in stage 3 can produce
> > -EFAULT, so that in the host thread can jump out of stage 3's 1st vcpu_run()
> > loop.
> 
> Ugh, I forgot that there are two loops in stage-3.  I tried to prevent this race,
> but violated my own rule of not using arbitrary delays to avoid races.
> 
> Completely untested, but I think this should address the problem (I'll test
> later today; you already did the hard work of debugging).  The only thing I'm
> not positive is correct is making the first _vcpu_run() a one-off instead of a
> loop.
Right, making the first _vcpu_run() a one-off could produce below error:
"Expected EFAULT on write to RO memory, got r = 0, errno = 4".

> 
> diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> index d9c76b4c0d88..9ac1800bb770 100644
> --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> @@ -18,6 +18,7 @@
>  #include "ucall_common.h"
>  
>  static bool mprotect_ro_done;
> +static bool vcpu_hit_ro_fault;
>  
>  static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>  {
> @@ -36,9 +37,9 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>  
>         /*
>          * Write to the region while mprotect(PROT_READ) is underway.  Keep
> -        * looping until the memory is guaranteed to be read-only, otherwise
> -        * vCPUs may complete their writes and advance to the next stage
> -        * prematurely.
> +        * looping until the memory is guaranteed to be read-only and a fault
> +        * has occured, otherwise vCPUs may complete their writes and advance
> +        * to the next stage prematurely.
>          *
>          * For architectures that support skipping the faulting instruction,
>          * generate the store via inline assembly to ensure the exact length
> @@ -56,7 +57,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>  #else
>                         vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
>  #endif
> -       } while (!READ_ONCE(mprotect_ro_done));
> +       } while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(vcpu_hit_ro_fault));
>  
>         /*
>          * Only architectures that write the entire range can explicitly sync,
> @@ -148,12 +149,13 @@ static void *vcpu_worker(void *data)
>          * be stuck on the faulting instruction for other architectures.  Go to
>          * stage 3 without a rendezvous
>          */
> -       do {
> -               r = _vcpu_run(vcpu);
> -       } while (!r);
> +       r = _vcpu_run(vcpu);
>         TEST_ASSERT(r == -1 && errno == EFAULT,
>                     "Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno);
>  
> +       /* Tell the vCPU it hit a RO fault. */
> +       WRITE_ONCE(vcpu_hit_ro_fault, true);
> +
>  #if defined(__x86_64__) || defined(__aarch64__)
>         /*
>          * Verify *all* writes from the guest hit EFAULT due to the VMA now
> @@ -378,7 +380,6 @@ int main(int argc, char *argv[])
>         rendezvous_with_vcpus(&time_run2, "run 2");
>  
>         mprotect(mem, slot_size, PROT_READ);
> -       usleep(10);
>         mprotect_ro_done = true;
>         sync_global_to_guest(vm, mprotect_ro_done);
> 
>
Sean Christopherson Feb. 27, 2025, 10:18 p.m. UTC | #13
On Thu, Feb 27, 2025, Yan Zhao wrote:
> On Wed, Feb 26, 2025 at 11:30:15AM -0800, Sean Christopherson wrote:
> > On Wed, Feb 26, 2025, Yan Zhao wrote:
> > > On Tue, Feb 25, 2025 at 05:48:39PM -0800, Sean Christopherson wrote:
> > > > On Sat, Feb 08, 2025, Yan Zhao wrote:
> > > > > The test then fails and reports "Unhandled exception '0xe' at guest RIP
> > > > > '0x402638'", since the next valid guest rip address is 0x402639, i.e. the
> > > > > "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
> > > > > of length 4.
> > > > 
> > > > This shouldn't happen.  On x86, stage 3 is a hand-coded "mov %rax, (%rax)", not
> > > > vcpu_arch_put_guest().  Either something else is going on, or __x86_64__ isn't
> > > > defined?
> > > stage 3 is hand-coded "mov %rax, (%rax)", but stage 4 is with
> > > vcpu_arch_put_guest().
> > > 
> > > The original code expects that "mov %rax, (%rax)" in stage 3 can produce
> > > -EFAULT, so that in the host thread can jump out of stage 3's 1st vcpu_run()
> > > loop.
> > 
> > Ugh, I forgot that there are two loops in stage-3.  I tried to prevent this race,
> > but violated my own rule of not using arbitrary delays to avoid races.
> > 
> > Completely untested, but I think this should address the problem (I'll test
> > later today; you already did the hard work of debugging).  The only thing I'm
> > not positive is correct is making the first _vcpu_run() a one-off instead of a
> > loop.
> Right, making the first _vcpu_run() a one-off could produce below error:
> "Expected EFAULT on write to RO memory, got r = 0, errno = 4".

/facepalm

There are multiple vCPU, using a single flag isn't sufficient.  I also remembered
(well, re-discovered) why I added the weird looping on "!":

	do {                                                                    
		r = _vcpu_run(vcpu);                                            
	} while (!r);

On x86, with forced emulation, the vcpu_arch_put_guest() path hits an MMIO exit
due to a longstanding (like, forever longstanding) bug in KVM's emulator.  Given
that the vcpu_arch_put_guest() path is only reachable by disabling the x86 specific
code (which I did for testing those paths), and that the bug only manifests on x86,
I think it makes sense to drop that code as it's super confusing, gets in the way,
and is unreachable unless the user is going way out of their way to hit it.

I still haven't reproduced the failure without "help", but I was able to force
failure by doing a single write and dropping the mprotect_ro_done check:

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index a1f3f6d83134..3524dcc0dfcf 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -50,15 +50,15 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
         */
        GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
        do {
-               for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+               // for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
-                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
+                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(end_gpa - stride) : "memory"); /* mov %rax, (%rax) */
 #elif defined(__aarch64__)
                        asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
 #else
                        vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
 #endif
-       } while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(all_vcpus_hit_ro_fault));
+       } while (!READ_ONCE(all_vcpus_hit_ro_fault));
 
        /*
         * Only architectures that write the entire range can explicitly sync,

The below makes everything happy, can you verify the fix on your end?

---
 tools/testing/selftests/kvm/mmu_stress_test.c | 22 ++++++++++++-------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index d9c76b4c0d88..a1f3f6d83134 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -18,6 +18,7 @@
 #include "ucall_common.h"
 
 static bool mprotect_ro_done;
+static bool all_vcpus_hit_ro_fault;
 
 static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 {
@@ -36,9 +37,9 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 
 	/*
 	 * Write to the region while mprotect(PROT_READ) is underway.  Keep
-	 * looping until the memory is guaranteed to be read-only, otherwise
-	 * vCPUs may complete their writes and advance to the next stage
-	 * prematurely.
+	 * looping until the memory is guaranteed to be read-only and a fault
+	 * has occured, otherwise vCPUs may complete their writes and advance
+	 * to the next stage prematurely.
 	 *
 	 * For architectures that support skipping the faulting instruction,
 	 * generate the store via inline assembly to ensure the exact length
@@ -47,6 +48,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 	 * is low in this case).  For x86, hand-code the exact opcode so that
 	 * there is no room for variability in the generated instruction.
 	 */
+	GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
 	do {
 		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
@@ -56,7 +58,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 #else
 			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
 #endif
-	} while (!READ_ONCE(mprotect_ro_done));
+	} while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(all_vcpus_hit_ro_fault));
 
 	/*
 	 * Only architectures that write the entire range can explicitly sync,
@@ -81,6 +83,7 @@ struct vcpu_info {
 
 static int nr_vcpus;
 static atomic_t rendezvous;
+static atomic_t nr_ro_faults;
 
 static void rendezvous_with_boss(void)
 {
@@ -148,12 +151,16 @@ static void *vcpu_worker(void *data)
 	 * be stuck on the faulting instruction for other architectures.  Go to
 	 * stage 3 without a rendezvous
 	 */
-	do {
-		r = _vcpu_run(vcpu);
-	} while (!r);
+	r = _vcpu_run(vcpu);
 	TEST_ASSERT(r == -1 && errno == EFAULT,
 		    "Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno);
 
+	atomic_inc(&nr_ro_faults);
+	if (atomic_read(&nr_ro_faults) == nr_vcpus) {
+		WRITE_ONCE(all_vcpus_hit_ro_fault, true);
+		sync_global_to_guest(vm, all_vcpus_hit_ro_fault);
+	}
+
 #if defined(__x86_64__) || defined(__aarch64__)
 	/*
 	 * Verify *all* writes from the guest hit EFAULT due to the VMA now
@@ -378,7 +385,6 @@ int main(int argc, char *argv[])
 	rendezvous_with_vcpus(&time_run2, "run 2");
 
 	mprotect(mem, slot_size, PROT_READ);
-	usleep(10);
 	mprotect_ro_done = true;
 	sync_global_to_guest(vm, mprotect_ro_done);
 

base-commit: 557953f8b75fce49dc65f9b0f7e811c060fc7860
--
Yan Zhao Feb. 28, 2025, 10:55 a.m. UTC | #14
On Thu, Feb 27, 2025 at 02:18:02PM -0800, Sean Christopherson wrote:
> On Thu, Feb 27, 2025, Yan Zhao wrote:
> > On Wed, Feb 26, 2025 at 11:30:15AM -0800, Sean Christopherson wrote:
> > > On Wed, Feb 26, 2025, Yan Zhao wrote:
> > > > On Tue, Feb 25, 2025 at 05:48:39PM -0800, Sean Christopherson wrote:
> > > > > On Sat, Feb 08, 2025, Yan Zhao wrote:
> > > > > > The test then fails and reports "Unhandled exception '0xe' at guest RIP
> > > > > > '0x402638'", since the next valid guest rip address is 0x402639, i.e. the
> > > > > > "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
> > > > > > of length 4.
> > > > > 
> > > > > This shouldn't happen.  On x86, stage 3 is a hand-coded "mov %rax, (%rax)", not
> > > > > vcpu_arch_put_guest().  Either something else is going on, or __x86_64__ isn't
> > > > > defined?
> > > > stage 3 is hand-coded "mov %rax, (%rax)", but stage 4 is with
> > > > vcpu_arch_put_guest().
> > > > 
> > > > The original code expects that "mov %rax, (%rax)" in stage 3 can produce
> > > > -EFAULT, so that in the host thread can jump out of stage 3's 1st vcpu_run()
> > > > loop.
> > > 
> > > Ugh, I forgot that there are two loops in stage-3.  I tried to prevent this race,
> > > but violated my own rule of not using arbitrary delays to avoid races.
> > > 
> > > Completely untested, but I think this should address the problem (I'll test
> > > later today; you already did the hard work of debugging).  The only thing I'm
> > > not positive is correct is making the first _vcpu_run() a one-off instead of a
> > > loop.
> > Right, making the first _vcpu_run() a one-off could produce below error:
> > "Expected EFAULT on write to RO memory, got r = 0, errno = 4".
> 
> /facepalm
> 
> There are multiple vCPU, using a single flag isn't sufficient.  I also remembered
> (well, re-discovered) why I added the weird looping on "!":
> 
> 	do {                                                                    
> 		r = _vcpu_run(vcpu);                                            
> 	} while (!r);
> 
> On x86, with forced emulation, the vcpu_arch_put_guest() path hits an MMIO exit
> due to a longstanding (like, forever longstanding) bug in KVM's emulator.  Given
> that the vcpu_arch_put_guest() path is only reachable by disabling the x86 specific
> code (which I did for testing those paths), and that the bug only manifests on x86,
> I think it makes sense to drop that code as it's super confusing, gets in the way,
> and is unreachable unless the user is going way out of their way to hit it.
Thanks for this background.


> I still haven't reproduced the failure without "help", but I was able to force
> failure by doing a single write and dropping the mprotect_ro_done check:
> 
> diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> index a1f3f6d83134..3524dcc0dfcf 100644
> --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> @@ -50,15 +50,15 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>          */
>         GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
>         do {
> -               for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> +               // for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
>  #ifdef __x86_64__
> -                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
> +                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(end_gpa - stride) : "memory"); /* mov %rax, (%rax) */
>  #elif defined(__aarch64__)
>                         asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
>  #else
>                         vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
>  #endif
> -       } while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(all_vcpus_hit_ro_fault));
> +       } while (!READ_ONCE(all_vcpus_hit_ro_fault));
>  
>         /*
>          * Only architectures that write the entire range can explicitly sync,
> 
> The below makes everything happy, can you verify the fix on your end?

This fix can make the issue disappear on my end. However, the issue is also not
reproducible even merely with the following change...

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index d9c76b4c0d88..e664713d2a2c 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -18,6 +18,7 @@
 #include "ucall_common.h"

 static bool mprotect_ro_done;
+static bool all_vcpus_hit_ro_fault;

 static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 {
@@ -34,6 +35,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
                *((volatile uint64_t *)gpa);
        GUEST_SYNC(2);

+       GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
        /*
         * Write to the region while mprotect(PROT_READ) is underway.  Keep
         * looping until the memory is guaranteed to be read-only, otherwise


I think it's due to the extra delay (the assert) in guest, as I previously also
mentioned at
https://lore.kernel.org/kvm/Z6xGwnFR9cFg%2FTOL@yzhao56-desk.sh.intel.com .

If I apply you fix with the guest delay dropped, the issue re-appears.

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index a1f3f6d83134..f87fd40dbed3 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -48,7 +48,6 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
         * is low in this case).  For x86, hand-code the exact opcode so that
         * there is no room for variability in the generated instruction.
         */
-       GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
        do {
                for (gpa = start_gpa; gpa < end_gpa; gpa += stride)

> ---
>  tools/testing/selftests/kvm/mmu_stress_test.c | 22 ++++++++++++-------
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> index d9c76b4c0d88..a1f3f6d83134 100644
> --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> @@ -18,6 +18,7 @@
>  #include "ucall_common.h"
>  
>  static bool mprotect_ro_done;
> +static bool all_vcpus_hit_ro_fault;
>  
>  static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>  {
> @@ -36,9 +37,9 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>  
>  	/*
>  	 * Write to the region while mprotect(PROT_READ) is underway.  Keep
> -	 * looping until the memory is guaranteed to be read-only, otherwise
> -	 * vCPUs may complete their writes and advance to the next stage
> -	 * prematurely.
> +	 * looping until the memory is guaranteed to be read-only and a fault
> +	 * has occured, otherwise vCPUs may complete their writes and advance
> +	 * to the next stage prematurely.
>  	 *
>  	 * For architectures that support skipping the faulting instruction,
>  	 * generate the store via inline assembly to ensure the exact length
> @@ -47,6 +48,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>  	 * is low in this case).  For x86, hand-code the exact opcode so that
>  	 * there is no room for variability in the generated instruction.
>  	 */
> +	GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
>  	do {
>  		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
>  #ifdef __x86_64__
> @@ -56,7 +58,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>  #else
>  			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
>  #endif
> -	} while (!READ_ONCE(mprotect_ro_done));
> +	} while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(all_vcpus_hit_ro_fault));
This looks not correct.

The while loop stops when
mprotect_ro_done | all_vcpus_hit_ro_fault
-----------------|----------------------
   true          |      false ==>producing "Expected EFAULT on write to RO memory"
   true          |      true
   false         |      true  (invalid case)
   

So, I think the right one is:
-	} while (!READ_ONCE(mprotect_ro_done));
+	} while (!READ_ONCE(mprotect_ro_done) || !READ_ONCE(all_vcpus_hit_ro_fault));

Then the while loop stops only when
mprotect_ro_done | all_vcpus_hit_ro_fault
-----------------|----------------------
  true           |      true

>  	/*
>  	 * Only architectures that write the entire range can explicitly sync,
> @@ -81,6 +83,7 @@ struct vcpu_info {
>  
>  static int nr_vcpus;
>  static atomic_t rendezvous;
> +static atomic_t nr_ro_faults;
>  
>  static void rendezvous_with_boss(void)
>  {
> @@ -148,12 +151,16 @@ static void *vcpu_worker(void *data)
>  	 * be stuck on the faulting instruction for other architectures.  Go to
>  	 * stage 3 without a rendezvous
>  	 */
> -	do {
> -		r = _vcpu_run(vcpu);
> -	} while (!r);
> +	r = _vcpu_run(vcpu);
>  	TEST_ASSERT(r == -1 && errno == EFAULT,
>  		    "Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno);
>  
> +	atomic_inc(&nr_ro_faults);
> +	if (atomic_read(&nr_ro_faults) == nr_vcpus) {
> +		WRITE_ONCE(all_vcpus_hit_ro_fault, true);
> +		sync_global_to_guest(vm, all_vcpus_hit_ro_fault);
> +	}
> +
>  #if defined(__x86_64__) || defined(__aarch64__)
>  	/*
>  	 * Verify *all* writes from the guest hit EFAULT due to the VMA now
> @@ -378,7 +385,6 @@ int main(int argc, char *argv[])
>  	rendezvous_with_vcpus(&time_run2, "run 2");
>  
>  	mprotect(mem, slot_size, PROT_READ);
> -	usleep(10);
>  	mprotect_ro_done = true;
>  	sync_global_to_guest(vm, mprotect_ro_done);
>  
> 
> base-commit: 557953f8b75fce49dc65f9b0f7e811c060fc7860
> -- 
So, with below change based on your fix above, the pass rate on my end is
100%. (If only with the first hunk below, the pass rate is 10%).

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index a1f3f6d83134..1c65c9c3f41f 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -48,7 +48,6 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
         * is low in this case).  For x86, hand-code the exact opcode so that
         * there is no room for variability in the generated instruction.
         */
-       GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
        do {
                for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
@@ -58,7 +57,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 #else
                        vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
 #endif
-       } while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(all_vcpus_hit_ro_fault));
+       } while (!READ_ONCE(mprotect_ro_done) || !READ_ONCE(all_vcpus_hit_ro_fault));

        /*
         * Only architectures that write the entire range can explicitly sync,
Sean Christopherson Feb. 28, 2025, 2 p.m. UTC | #15
On Fri, Feb 28, 2025, Yan Zhao wrote:
> On Thu, Feb 27, 2025 at 02:18:02PM -0800, Sean Christopherson wrote:
> > On Thu, Feb 27, 2025, Yan Zhao wrote:
> So, I think the right one is:
> -	} while (!READ_ONCE(mprotect_ro_done));
> +	} while (!READ_ONCE(mprotect_ro_done) || !READ_ONCE(all_vcpus_hit_ro_fault));

/double facepalm

You're 100% correct.  I did most of my testing with just the all_vcpus_hit_ro_fault
check, and then botched things when adding back mprotect_ro_done.
Yan Zhao March 3, 2025, 3:14 a.m. UTC | #16
On Fri, Feb 28, 2025 at 06:00:39AM -0800, Sean Christopherson wrote:
> On Fri, Feb 28, 2025, Yan Zhao wrote:
> > On Thu, Feb 27, 2025 at 02:18:02PM -0800, Sean Christopherson wrote:
> > > On Thu, Feb 27, 2025, Yan Zhao wrote:
> > So, I think the right one is:
> > -	} while (!READ_ONCE(mprotect_ro_done));
> > +	} while (!READ_ONCE(mprotect_ro_done) || !READ_ONCE(all_vcpus_hit_ro_fault));
> 
> /double facepalm
> 
> You're 100% correct.  I did most of my testing with just the all_vcpus_hit_ro_fault
Haha, however, I failed writing code that makes you 100% happy in one shot :)
Your version is indeed cleaner than mine!

> check, and then botched things when adding back mprotect_ro_done.
diff mbox series

Patch

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index d9c76b4c0d88..a2ccf705bb2a 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -35,11 +35,16 @@  static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 	GUEST_SYNC(2);
 
 	/*
-	 * Write to the region while mprotect(PROT_READ) is underway.  Keep
-	 * looping until the memory is guaranteed to be read-only, otherwise
-	 * vCPUs may complete their writes and advance to the next stage
-	 * prematurely.
-	 *
+	 * mprotect(PROT_READ) is underway.  Keep looping until the memory is
+	 * guaranteed to be read-only, otherwise vCPUs may complete their
+	 * writes and advance to the next stage prematurely.
+	 */
+	do {
+		;
+	} while (!READ_ONCE(mprotect_ro_done));
+
+	/*
+	 * Write to the region after mprotect(PROT_READ) is done.
 	 * For architectures that support skipping the faulting instruction,
 	 * generate the store via inline assembly to ensure the exact length
 	 * of the instruction is known and stable (vcpu_arch_put_guest() on
@@ -47,16 +52,14 @@  static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 	 * is low in this case).  For x86, hand-code the exact opcode so that
 	 * there is no room for variability in the generated instruction.
 	 */
-	do {
-		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+	for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
-			asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
+		asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
 #elif defined(__aarch64__)
-			asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
+		asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
 #else
-			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
+		vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
 #endif
-	} while (!READ_ONCE(mprotect_ro_done));
 
 	/*
 	 * Only architectures that write the entire range can explicitly sync,