diff mbox series

[09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)

Message ID 20240809194335.1726916-10-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Allow yielding on mmu_notifier zap | expand

Commit Message

Sean Christopherson Aug. 9, 2024, 7:43 p.m. UTC
Add two phases to mmu_stress_test to verify that KVM correctly handles
guest memory that was writable, and then made read-only in the primary MMU,
and then made writable again.

Add bonus coverage for x86 to verify that all of guest memory was marked
read-only.  Making forward progress (without making memory writable)
requires arch specific code to skip over the faulting instruction, but the
test can at least verify each vCPU's starting page was made read-only.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/mmu_stress_test.c | 87 ++++++++++++++++++-
 1 file changed, 84 insertions(+), 3 deletions(-)

Comments

Sean Christopherson Sept. 7, 2024, 12:53 a.m. UTC | #1
On Fri, Sep 06, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Add two phases to mmu_stress_test to verify that KVM correctly handles
> > guest memory that was writable, and then made read-only in the primary MMU,
> > and then made writable again.
> >
> > Add bonus coverage for x86 to verify that all of guest memory was marked
> > read-only.  Making forward progress (without making memory writable)
> > requires arch specific code to skip over the faulting instruction, but the
> > test can at least verify each vCPU's starting page was made read-only.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> Writing off-list because I just have some smaller questions that I
> don't want to bother the list with.

Pulling everyone and the lists back in :-)

IMO, no question is too small for kvm@, and lkml@ is gigantic firehose that's 99%
archival and 1% list, at best.  Odds are very, very good that if you have a
question, however trivial or small, then someone else has the exact same question,
or _will_ have the question in the future.

I strongly prefer that all questions, review, feedback, etc. happen on list, even
if the questions/feedback may seem trivial or noisy.  The only exception is if
information can't/shouldn't be made public, e.g. because of an embargo, NDA,
security implications, etc.

> For the next version, feel free to add:
> 
> Reviewed-by: James Houghton <jthoughton@google.com>
> 
> All of the selftest patches look fine to me.
> 
> > ---
> >  tools/testing/selftests/kvm/mmu_stress_test.c | 87 ++++++++++++++++++-
> >  1 file changed, 84 insertions(+), 3 deletions(-)
> >
> > diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> > index 50c3a17418c4..98f9a4660269 100644
> > --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> > +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> > @@ -16,6 +16,8 @@
> >  #include "guest_modes.h"
> >  #include "processor.h"
> >
> > +static bool mprotect_ro_done;
> > +
> >  static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> >  {
> >         uint64_t gpa;
> > @@ -31,6 +33,33 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> >                 *((volatile uint64_t *)gpa);
> >         GUEST_SYNC(2);
> >
> > +       /*
> > +        * Write to the region while mprotect(PROT_READ) is underway.  Keep
> > +        * looping until the memory is guaranteed to be read-only, otherwise
> > +        * vCPUs may complete their writes and advance to the next stage
> > +        * prematurely.
> > +        */
> > +       do {
> > +               for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> > +#ifdef __x86_64__
> > +                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory");
> 
> Ok so this appears to be a `mov BYTE PTR [rax + 0x0], 0x0`, where %rax = gpa. :)
> 
> Does '0xc6,0x0,0x0' also work? It seems like that translates to `mov
> BYTE PTR [rax], 0x0`. (just curious, no need to change it)

LOL, yes, but as evidenced by the trailing comment, my intent was to generate
"mov rax, [rax]", not "movb $0, [rax]".  I suspect I was too lazy to consult the
SDM to recall the correct opcode and simply copied an instruction from some random
disassembly output without looking too closely at the output.

	asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */

> And I take it you wrote it out like this (instead of using mnemonics)
> so that you could guarantee that IP + 4 would be the right way to skip
> forwards. Does it make sense to leave a comment about that? 

Yes and yes.

> The translation from mnemonic --> bytes won't change...

Heh, this is x86, by no means is that guaranteed.  E.g. see the above, where the
same mnemonic can be represented multiple ways.

> so could you just write the proper assembly? (not a request,  just curious)

In practice, probably.  But the rules for inline assembly are, at best, fuzzy.
So long as the correct instruction is generated, the assembler has quite a bit
of freedom.

E.g. similar to above, "mov %rax,(%rax)" can (should) be encoded as:

  48 89 00

but can also be encoded as

  48 89 40 00

Now, it's _extremely_ unlikely a compiler will actually generate the latter, but
it's perfectly legal to do so.  E.g. with gcc-13, this

  mov %rax, 0x0(%rax)

generates

  48 89 00

even though a more literal interpretation would be

  48 89 40 00

So yeah, while the hand-coded opcode is gross and annoying, given that a failure
due to the "wrong" instruction being generated would be painful and time consuming
to debug, hand-coding is worth avoiding the risk and potential pain if the compiler
decides to be mean :-)

> A comment that 0x40 corresponds to %rax and that "a" also corresponds
> to %rax would have been helpful for me. :)

Eh, I get what you're saying, but giving a play-by-play of the encoding isn't
really all that reasonable because _so_ much information needs to be conveyed to
capture the entire picture, and some things are essentially table stakes when it
comes to x86 kernel programming.

E.g. 0x40 doesn't simply mean "(%rax)", it's a full ModR/M that defines the
addressing mode, which in turn depends on the operating mode (64-bit).

And "a" isn't just %rax; it's specifically an input register constraint, e.g. is
distinctly different than:

  asm volatile(".byte 0x48,0x89,0x0" : "+a"(gpa) :: "memory"); /* mov %rax, (%rax) */

even though in this specific scenario they generate the same code.

And with the correct "48 89 00", understanding the full encoding requires describing
REX prefixes, which are a mess unto themselves.

So, a trailing comment (with the correct mnemonic) is all I'm willing to do, even
though I 100% agree that it puts a decent sized load on the reader.  There's just
_too_ much information to communicate to the reader, at least for x86.

> > +#else
> > +                       vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> > +#endif
> > +       } while (!READ_ONCE(mprotect_ro_done));
> > +
> > +       /*
> > +        * Only x86 can explicitly sync, as other architectures will be stuck
> > +        * on the write fault.
> 
> It would also have been a little clearer if the comment also said how
> this is just because the test has been written to increment for the PC
> upon getting these write faults *for x86 only*. IDK something like
> "For x86, the test will adjust the PC for each write fault, allowing
> the above loop to complete. Other architectures will get stuck, so the
> #3 sync step is skipped."

Ya.  Untested, but how about this?

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index 2d66c2724336..29acb22ea387 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -38,11 +38,18 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
         * looping until the memory is guaranteed to be read-only, otherwise
         * vCPUs may complete their writes and advance to the next stage
         * prematurely.
+        *
+        * For architectures that support skipping the faulting instruction,
+        * generate the store via inline assembly to ensure the exact length
+        * of the instruction is known and stable (vcpu_arch_put_guest() on
+        * fixed-length architectures should work, but the cost of paranoia
+        * is low in this case).  For x86, hand-code the exact opcode so that
+        * there is no room for variability in the generated instruction.
         */
        do {
                for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
-                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */
+                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
 #elif defined(__aarch64__)
                        asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
 #else
@@ -163,7 +170,7 @@ static void *vcpu_worker(void *data)
                TEST_ASSERT_EQ(errno, EFAULT);
 #ifdef __x86_64__
                WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS);
-               vcpu->run->s.regs.regs.rip += 4;
+               vcpu->run->s.regs.regs.rip += 3;
 #endif
 #ifdef __aarch64__
                vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc),
James Houghton Sept. 7, 2024, 2:26 a.m. UTC | #2
On Fri, Sep 6, 2024 at 5:53 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Sep 06, 2024, James Houghton wrote:
> > On Fri, Aug 9, 2024 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Add two phases to mmu_stress_test to verify that KVM correctly handles
> > > guest memory that was writable, and then made read-only in the primary MMU,
> > > and then made writable again.
> > >
> > > Add bonus coverage for x86 to verify that all of guest memory was marked
> > > read-only.  Making forward progress (without making memory writable)
> > > requires arch specific code to skip over the faulting instruction, but the
> > > test can at least verify each vCPU's starting page was made read-only.
> > >
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> >
> > Writing off-list because I just have some smaller questions that I
> > don't want to bother the list with.
>
> Pulling everyone and the lists back in :-)
>
> IMO, no question is too small for kvm@, and lkml@ is gigantic firehose that's 99%
> archival and 1% list, at best.  Odds are very, very good that if you have a
> question, however trivial or small, then someone else has the exact same question,
> or _will_ have the question in the future.
>
> I strongly prefer that all questions, review, feedback, etc. happen on list, even
> if the questions/feedback may seem trivial or noisy.  The only exception is if
> information can't/shouldn't be made public, e.g. because of an embargo, NDA,
> security implications, etc.

I'll keep this in mind, thanks!

> > For the next version, feel free to add:
> >
> > Reviewed-by: James Houghton <jthoughton@google.com>
> >
> > All of the selftest patches look fine to me.
> >
> > > ---
> > >  tools/testing/selftests/kvm/mmu_stress_test.c | 87 ++++++++++++++++++-
> > >  1 file changed, 84 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> > > index 50c3a17418c4..98f9a4660269 100644
> > > --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> > > +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> > > @@ -16,6 +16,8 @@
> > >  #include "guest_modes.h"
> > >  #include "processor.h"
> > >
> > > +static bool mprotect_ro_done;
> > > +
> > >  static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> > >  {
> > >         uint64_t gpa;
> > > @@ -31,6 +33,33 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> > >                 *((volatile uint64_t *)gpa);
> > >         GUEST_SYNC(2);
> > >
> > > +       /*
> > > +        * Write to the region while mprotect(PROT_READ) is underway.  Keep
> > > +        * looping until the memory is guaranteed to be read-only, otherwise
> > > +        * vCPUs may complete their writes and advance to the next stage
> > > +        * prematurely.
> > > +        */
> > > +       do {
> > > +               for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> > > +#ifdef __x86_64__
> > > +                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory");
> >
> > Ok so this appears to be a `mov BYTE PTR [rax + 0x0], 0x0`, where %rax = gpa. :)
> >
> > Does '0xc6,0x0,0x0' also work? It seems like that translates to `mov
> > BYTE PTR [rax], 0x0`. (just curious, no need to change it)
>
> LOL, yes, but as evidenced by the trailing comment, my intent was to generate
> "mov rax, [rax]", not "movb $0, [rax]".  I suspect I was too lazy to consult the
> SDM to recall the correct opcode and simply copied an instruction from some random
> disassembly output without looking too closely at the output.
>
>         asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */
>
> > And I take it you wrote it out like this (instead of using mnemonics)
> > so that you could guarantee that IP + 4 would be the right way to skip
> > forwards. Does it make sense to leave a comment about that?
>
> Yes and yes.
>
> > The translation from mnemonic --> bytes won't change...
>
> Heh, this is x86, by no means is that guaranteed.  E.g. see the above, where the
> same mnemonic can be represented multiple ways.
>
> > so could you just write the proper assembly? (not a request,  just curious)
>
> In practice, probably.  But the rules for inline assembly are, at best, fuzzy.
> So long as the correct instruction is generated, the assembler has quite a bit
> of freedom.
>
> E.g. similar to above, "mov %rax,(%rax)" can (should) be encoded as:
>
>   48 89 00
>
> but can also be encoded as
>
>   48 89 40 00
>
> Now, it's _extremely_ unlikely a compiler will actually generate the latter, but
> it's perfectly legal to do so.  E.g. with gcc-13, this
>
>   mov %rax, 0x0(%rax)
>
> generates
>
>   48 89 00
>
> even though a more literal interpretation would be
>
>   48 89 40 00

Oh... neat! I'm glad I asked about this.

>
> So yeah, while the hand-coded opcode is gross and annoying, given that a failure
> due to the "wrong" instruction being generated would be painful and time consuming
> to debug, hand-coding is worth avoiding the risk and potential pain if the compiler
> decides to be mean :-)

I 100% agree with hand-writing the opcode given what you've said. :)

> > A comment that 0x40 corresponds to %rax and that "a" also corresponds
> > to %rax would have been helpful for me. :)
>
> Eh, I get what you're saying, but giving a play-by-play of the encoding isn't
> really all that reasonable because _so_ much information needs to be conveyed to
> capture the entire picture, and some things are essentially table stakes when it
> comes to x86 kernel programming.
>
>
> E.g. 0x40 doesn't simply mean "(%rax)", it's a full ModR/M that defines the
> addressing mode, which in turn depends on the operating mode (64-bit).
>
> And "a" isn't just %rax; it's specifically an input register constraint, e.g. is
> distinctly different than:
>
>   asm volatile(".byte 0x48,0x89,0x0" : "+a"(gpa) :: "memory"); /* mov %rax, (%rax) */
>
> even though in this specific scenario they generate the same code.
>
> And with the correct "48 89 00", understanding the full encoding requires describing
> REX prefixes, which are a mess unto themselves.
>
> So, a trailing comment (with the correct mnemonic) is all I'm willing to do, even
> though I 100% agree that it puts a decent sized load on the reader.  There's just
> _too_ much information to communicate to the reader, at least for x86.

The trailing comment works for me! Thanks for all the detail -- I am
learning so much.

>
> > > +#else
> > > +                       vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> > > +#endif
> > > +       } while (!READ_ONCE(mprotect_ro_done));
> > > +
> > > +       /*
> > > +        * Only x86 can explicitly sync, as other architectures will be stuck
> > > +        * on the write fault.
> >
> > It would also have been a little clearer if the comment also said how
> > this is just because the test has been written to increment for the PC
> > upon getting these write faults *for x86 only*. IDK something like
> > "For x86, the test will adjust the PC for each write fault, allowing
> > the above loop to complete. Other architectures will get stuck, so the
> > #3 sync step is skipped."
>
> Ya.  Untested, but how about this?

LGTM! (I haven't tested it either.)

>
> diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> index 2d66c2724336..29acb22ea387 100644
> --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> @@ -38,11 +38,18 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>          * looping until the memory is guaranteed to be read-only, otherwise
>          * vCPUs may complete their writes and advance to the next stage
>          * prematurely.
> +        *
> +        * For architectures that support skipping the faulting instruction,
> +        * generate the store via inline assembly to ensure the exact length
> +        * of the instruction is known and stable (vcpu_arch_put_guest() on
> +        * fixed-length architectures should work, but the cost of paranoia
> +        * is low in this case).  For x86, hand-code the exact opcode so that
> +        * there is no room for variability in the generated instruction.
>          */
>         do {
>                 for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
>  #ifdef __x86_64__
> -                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */
> +                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */

FWIW I much prefer the trailing comment you have ended up with vs. the
one you had before. (To me, the older one _seems_ like it's Intel
syntax, in which case the comment says it's a load..? The comment you
have now is, to me, obviously indicating a store. Though... perhaps
"movq"?)


>  #elif defined(__aarch64__)
>                         asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
>  #else
> @@ -163,7 +170,7 @@ static void *vcpu_worker(void *data)
>                 TEST_ASSERT_EQ(errno, EFAULT);
>  #ifdef __x86_64__
>                 WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS);
> -               vcpu->run->s.regs.regs.rip += 4;
> +               vcpu->run->s.regs.regs.rip += 3;
>  #endif
>  #ifdef __aarch64__
>                 vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc),
>
Sean Christopherson Sept. 9, 2024, 3:49 p.m. UTC | #3
On Fri, Sep 06, 2024, James Houghton wrote:
> On Fri, Sep 6, 2024 at 5:53 PM Sean Christopherson <seanjc@google.com> wrote:
> >  #ifdef __x86_64__
> > -                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */
> > +                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
> 
> FWIW I much prefer the trailing comment you have ended up with vs. the
> one you had before. (To me, the older one _seems_ like it's Intel
> syntax, in which case the comment says it's a load..? The comment you
> have now is, to me, obviously indicating a store. Though... perhaps
> "movq"?)

TL;DR: "movq" is arguably a worse mnemonic than simply "mov" because MOV *and*
        MOVQ are absurdly overloaded mnemonics, and because x86-64 is wonky.

Heh, "movq" is technically a different instruction (MMX/SSE instruction).  For
ambiguous mnemonics, the assembler infers the exact instructions from the operands.
When a register is the source or destination, appending the size to a vanilla MOV
is 100% optional, as the width of the register communicates the desired size
without any ambiguity.

When there is no register operand, e.g. storing an immediate to memory, the size
becomes necessary, sort of.  The assembler will still happily accept an inferred
size, but the size is simply the default operand size for the current mode.

E.g.

  mov $0xffff, (%0)

will generate a 4-byte MOV

  c7 00 ff ff 00 00

so if you actually wanted a 2-byte MOV, the mnemonic needs to be:

  movw $0xffff, (%0)

There is still value in specifying an explicit operand size in assembly, as it
disambiguates the size of human readers, and also generates an error if the
operands mismatch.

E.g.

  movw $0xffff, %%eax

will fail with

  incorrect register `%eax' used with `w' suffix

The really fun one is if ou want to load a 64-bit gpr with an immediate.  All
else being equal, the assembler will generally optimize for code size, and so
if the desired value can be generated by sign-extension, the compiler will opt
for opcode 0xc7 or 0xb8

E.g.

  mov $0xffffffffffffffff, %%rax

generates

  48 c7 c0 ff ff ff ff

whereas, somewhat counter-intuitively, this

  mov $0xffffffff, %%rax

generates the more gnarly

  48 b8 ff ff ff ff 00 00 00 00


But wait, there's more!  If the developer were a wee bit smarter, they could/should
actually write

  mov $0xffffffff, %%eax

to generate

  b8 ff ff ff ff

because in x86-64, writing the lower 32 bits of a 64-bit register architecturally
clears the upper 32 bits.  I mention this because you'll actually see the compiler
take advantage of this behavior.

E.g. if you were to load RAX through an inline asm constraint

  asm volatile(".byte 0xcc" :: "a"(0xffffffff) : "memory");

the generated code will indeed be:

  b8 ff ff ff ff          mov    $0xffffffff,%eax

or if you explicitly load a register with '0'

  31 c0                   xor    %eax,%eax

Lastly, because "%0" in 64-bit mode refers to RAX, not EAX, this:

  asm volatile("mov $0xffffffff, %0" :: "a"(gpa) : "memory");

generates

  48 b8 ff ff ff ff 00 00 00 00

i.e. is equivalent to "mov .., %%rax".

Jumping back to "movq", it's perfectly fine in this case, but also fully
redundant.  And so I would prefer to document it simply as "mov", because "movq"
would be more appropriate to document something like this:

  asm volatile("movq %0, %%xmm0" :: "a"(gpa) : "memory");

  66 48 0f 6e c0          movq   %rax,%xmm0

LOL, which brings up more quirks/warts with x86-64.  Many instructions in x86,
especially SIMD instructions, have mandatory "prefixes" in order to squeeze more
instructions out of the available opcodes.  E.g. the operand size prefix, 0x66,
is reserved for MMX instructions, which allows the architecture to usurp the
reserved combination for XMM instructions.   Table 9-3. Effect of Prefixes on MMX
Instructions says this

  Operand Size (66H)Reserved and may result in unpredictable behavior.

and specifically says "unpredictable behavior" instead of #UD, because prefixing
most MMX instructions with 0x66 "promotes" the instruction to operate on XMM
registers.

And then there's the REX prefix, which is actually four prefixes built into one.
The "base" prefix ix 0x40, with the lower 4 bits encoding the four "real" prefixes.
From Table 2-4. REX Prefix Fields [BITS: 0100WRXB]

  Field Name      Bit Position    Definition
  -               7:4             0100
  W               3               0 = Operand size determined by CS.D, 1 = 64 Bit Operand Size
  R               2               Extension of the ModR/M reg field
  X               1               Extension of the SIB index field
  B               0               Extension of the ModR/M r/m field, SIB base field, or Opcode reg field

e.g. 0x48 is REX.W, 0x49 is REX.W+REX.B, etc.

The first quirky thing with REX, and REX.W (0x48) + the legacy operand size
prefix (0x66) in particular, is that the legacy prefix is ignored in most cases
if REX.W=1.

  For non-byte operations: if a 66H prefix is used with prefix (REX.W = 1), 66H is ignored.

But because 0x66 is a mandatory prefix for MOVQ, it's not ignored (at least, I
don't _think_ it's ignored; objdump and gdb both seem to happy decoding MOVQ
without the prefix).

Anyways, the second quirky thing with REX is that, because REX usurps single-byte
opcodes for DEC and INC

  In 64-bit mode, DEC r16 and DEC r32 are not encodable (because opcodes 48H through
  4FH are REX prefixes).
 
  In 64-bit mode, INC r16 and INC r32 are not encodable (because opcodes 40H through
  47H are REX prefixes).

i.e. uses opcodes that are actual instructions outside of 64-bit mode, the REX
prefix _must_ be the last byte before the non-prefix opcode, otherwise it's
ignored (presumably this avoids extra complexity in the instruction decoder).

  Only one REX prefix is allowed per instruction. If used, the REX prefix byte
  must immediately precede the opcode byte or the escape opcode byte (0FH). When
  a REX prefix is used in conjunction with an instruction containing a mandatory
  prefix, the mandatory prefix must come before the REX so the REX prefix can be
  immediately preceding the opcode or the escape byte. For example, CVTDQ2PD with
  a REX prefix should have REX placed between F3 and 0F E6. Other placements are
  ignored. The instruction-size limit of 15 bytes still applies to instructions
  with a REX prefix.

So even though the "opcode" for MOVQ is "66 0F 6E" , when encoding with REX.W to
address RAX instead of EAX, the full encoding needs to be "66 48 0F DE", otherwise
REX.W will be ignored, e.g. objdump will interpret it as this, even though the
CPU will decode REX.W as part of the MOVD.

  4024d1:       48                      rex.W
  4024d2:       66 0f 6e c0             movd   %eax,%xmm0

And because _that's_ just not confusing enough, there are actually _six_ distinct
opcodes for MOVQ (I think; might be more): two which are REX.W promotions of MOVD
(6E and 7E, ignoring mandatory prefixes and the escape opcode 0F), and four that
are straight quadword moves that can't target registers, i.e. can be encoded even
in 32-bit mode (6F, 7E, 7F, and D6).

So yeah, MOVQ in particular is a disaster, especially in 64-bit mode, so I'd much
prefer to just say "mov %rax, (%rax)" and leave it to the reader to understand
that it's a 64-bit store.
diff mbox series

Patch

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index 50c3a17418c4..98f9a4660269 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -16,6 +16,8 @@ 
 #include "guest_modes.h"
 #include "processor.h"
 
+static bool mprotect_ro_done;
+
 static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 {
 	uint64_t gpa;
@@ -31,6 +33,33 @@  static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 		*((volatile uint64_t *)gpa);
 	GUEST_SYNC(2);
 
+	/*
+	 * Write to the region while mprotect(PROT_READ) is underway.  Keep
+	 * looping until the memory is guaranteed to be read-only, otherwise
+	 * vCPUs may complete their writes and advance to the next stage
+	 * prematurely.
+	 */
+	do {
+		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+#ifdef __x86_64__
+			asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory");
+#else
+			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
+#endif
+	} while (!READ_ONCE(mprotect_ro_done));
+
+	/*
+	 * Only x86 can explicitly sync, as other architectures will be stuck
+	 * on the write fault.
+	 */
+#ifdef __x86_64__
+	GUEST_SYNC(3);
+#endif
+
+	for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+		vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
+	GUEST_SYNC(4);
+
 	GUEST_ASSERT(0);
 }
 
@@ -78,6 +107,7 @@  static void *vcpu_worker(void *data)
 	struct vcpu_info *info = data;
 	struct kvm_vcpu *vcpu = info->vcpu;
 	struct kvm_vm *vm = vcpu->vm;
+	int r;
 
 	vcpu_args_set(vcpu, 3, info->start_gpa, info->end_gpa, vm->page_size);
 
@@ -100,6 +130,49 @@  static void *vcpu_worker(void *data)
 
 	/* Stage 2, read all of guest memory, which is now read-only. */
 	run_vcpu(vcpu, 2);
+
+	/*
+	 * Stage 3, write guest memory and verify KVM returns -EFAULT for once
+	 * the mprotect(PROT_READ) lands.  Only architectures that support
+	 * validating *all* of guest memory sync for this stage, as vCPUs will
+	 * be stuck on the faulting instruction for other architectures.  Go to
+	 * stage 3 without a rendezvous
+	 */
+	do {
+		r = _vcpu_run(vcpu);
+	} while (!r);
+	TEST_ASSERT(r == -1 && errno == EFAULT,
+		    "Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno);
+
+#ifdef __x86_64__
+	/*
+	 * Verify *all* writes from the guest hit EFAULT due to the VMA now
+	 * being read-only.  x86-only at this time as skipping the instruction
+	 * that hits the EFAULT requires advancing the program counter, which
+	 * is arch specific and currently relies on hand-coded assembly.
+	 */
+	vcpu->run->kvm_valid_regs = KVM_SYNC_X86_REGS;
+	for (;;) {
+		r = _vcpu_run(vcpu);
+		if (!r)
+			break;
+		TEST_ASSERT_EQ(errno, EFAULT);
+		WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS);
+		vcpu->run->s.regs.regs.rip += 4;
+	}
+	assert_sync_stage(vcpu, 3);
+#endif
+	rendezvous_with_boss();
+
+	/*
+	 * Stage 4.  Run to completion, waiting for mprotect(PROT_WRITE) to
+	 * make the memory writable again.
+	 */
+	do {
+		r = _vcpu_run(vcpu);
+	} while (r && errno == EFAULT);
+	TEST_ASSERT_EQ(r, 0);
+	assert_sync_stage(vcpu, 4);
 	rendezvous_with_boss();
 
 	return NULL;
@@ -182,7 +255,7 @@  int main(int argc, char *argv[])
 	const uint64_t start_gpa = SZ_4G;
 	const int first_slot = 1;
 
-	struct timespec time_start, time_run1, time_reset, time_run2, time_ro;
+	struct timespec time_start, time_run1, time_reset, time_run2, time_ro, time_rw;
 	uint64_t max_gpa, gpa, slot_size, max_mem, i;
 	int max_slots, slot, opt, fd;
 	bool hugepages = false;
@@ -287,19 +360,27 @@  int main(int argc, char *argv[])
 	rendezvous_with_vcpus(&time_run2, "run 2");
 
 	mprotect(mem, slot_size, PROT_READ);
+	usleep(10);
+	mprotect_ro_done = true;
+	sync_global_to_guest(vm, mprotect_ro_done);
+
 	rendezvous_with_vcpus(&time_ro, "mprotect RO");
+	mprotect(mem, slot_size, PROT_READ | PROT_WRITE);
+	rendezvous_with_vcpus(&time_rw, "mprotect RW");
 
+	time_rw    = timespec_sub(time_rw,     time_ro);
 	time_ro    = timespec_sub(time_ro,     time_run2);
 	time_run2  = timespec_sub(time_run2,   time_reset);
 	time_reset = timespec_sub(time_reset,  time_run1);
 	time_run1  = timespec_sub(time_run1,   time_start);
 
 	pr_info("run1 = %ld.%.9lds, reset = %ld.%.9lds, run2 = %ld.%.9lds, "
-		"ro = %ld.%.9lds\n",
+		"ro = %ld.%.9lds, rw = %ld.%.9lds\n",
 		time_run1.tv_sec, time_run1.tv_nsec,
 		time_reset.tv_sec, time_reset.tv_nsec,
 		time_run2.tv_sec, time_run2.tv_nsec,
-		time_ro.tv_sec, time_ro.tv_nsec);
+		time_ro.tv_sec, time_ro.tv_nsec,
+		time_rw.tv_sec, time_rw.tv_nsec);
 
 	/*
 	 * Delete even numbered slots (arbitrary) and unmap the first half of