diff mbox series

[v3,04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN

Message ID 20230412213510.1220557-5-amoorthy@google.com (mailing list archive)
State New, archived
Headers show
Series Improve scalability of KVM + userfaultfd live migration via annotated memory faults. | expand

Commit Message

Anish Moorthy April 12, 2023, 9:34 p.m. UTC
Give kvm_run.exit_reason a defined initial value on entry into KVM_RUN:
other architectures (riscv, arm64) already use KVM_EXIT_UNKNOWN for this
purpose, so copy that convention.

This gives vCPUs trying to fill the run struct a mechanism to avoid
overwriting already-populated data, albeit an imperfect one. Being able
to detect an already-populated KVM run struct will prevent at least some
bugs in the upcoming implementation of KVM_CAP_MEMORY_FAULT_INFO, which
will attempt to fill the run struct whenever a vCPU fails a guest memory
access.

Without the already-populated check, KVM_CAP_MEMORY_FAULT_INFO could
change kvm_run in any code paths which

1. Populate kvm_run for some exit and prepare to return to userspace
2. Access guest memory for some reason (but without returning -EFAULTs
    to userspace)
3. Finish the return to userspace set up in (1), now with the contents
    of kvm_run changed to contain efault info.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Anish Moorthy May 2, 2023, 5:17 p.m. UTC | #1
During some testing yesterday I realized that this patch actually
breaks the self test, causing an error which the later self test
changes cover up.

Running "./demand_paging_test -b 512M -u MINOR -s shmem -v 1" from
kvm/next (b3c98052d469) with just this patch applies gives the
following output

> # ./demand_paging_test -b 512M -u MINOR -s shmem -v 1
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory: [0x7fcdfffe000, 0x7fcffffe000)
> Finished creating vCPUs and starting uffd threads
> Started all vCPUs
> ==== Test Assertion Failure ====
>  demand_paging_test.c:50: false
>  pid=13293 tid=13297 errno=4 - Interrupted system call
>  // Some stack trace stuff
>  Invalid guest sync status: exit_reason=UNKNOWN, ucall=0

The problem is the get_ucall() part of the following block in the self
test's vcpu_worker()

> ret = _vcpu_run(vcpu);
> TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
>    TEST_ASSERT(false,
>                               "Invalid guest sync status: exit_reason=%s\n",
>                               exit_reason_str(run->exit_reason));
> }

I took a look and, while get_ucall() does depend on the value of
exit_reason, the error's root cause isn't clear to me yet.

Moving the "exit_reason = kvm_exit_unknown" line to later in the
function, right above the vcpu_run() call "fixes" the problem. I've
done that for now and will bisect later to investigate: if anyone
has any clues please let me know.
Sean Christopherson May 2, 2023, 6:51 p.m. UTC | #2
On Tue, May 02, 2023, Anish Moorthy wrote:
> During some testing yesterday I realized that this patch actually
> breaks the self test, causing an error which the later self test
> changes cover up.
> 
> Running "./demand_paging_test -b 512M -u MINOR -s shmem -v 1" from
> kvm/next (b3c98052d469) with just this patch applies gives the
> following output
> 
> > # ./demand_paging_test -b 512M -u MINOR -s shmem -v 1
> > Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> > guest physical test memory: [0x7fcdfffe000, 0x7fcffffe000)
> > Finished creating vCPUs and starting uffd threads
> > Started all vCPUs
> > ==== Test Assertion Failure ====
> >  demand_paging_test.c:50: false
> >  pid=13293 tid=13297 errno=4 - Interrupted system call
> >  // Some stack trace stuff
> >  Invalid guest sync status: exit_reason=UNKNOWN, ucall=0
> 
> The problem is the get_ucall() part of the following block in the self
> test's vcpu_worker()
> 
> > ret = _vcpu_run(vcpu);
> > TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> > if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
> >    TEST_ASSERT(false,
> >                               "Invalid guest sync status: exit_reason=%s\n",
> >                               exit_reason_str(run->exit_reason));
> > }
> 
> I took a look and, while get_ucall() does depend on the value of
> exit_reason, the error's root cause isn't clear to me yet.

Stating what you likely already know... On x86, the UCALL is performed via port
I/O, and so the selftests framework zeros out the ucall struct if the userspace
exit reason isn't KVM_EXIT_IO.

> Moving the "exit_reason = kvm_exit_unknown" line to later in the
> function, right above the vcpu_run() call "fixes" the problem. I've
> done that for now and will bisect later to investigate: if anyone
> has any clues please let me know.

Clobbering vcpu->run->exit_reason before this code block is a bug:

	if (unlikely(vcpu->arch.complete_userspace_io)) {
		int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
		vcpu->arch.complete_userspace_io = NULL;
		r = cui(vcpu);
		if (r <= 0)
			goto out;
	} else {
		WARN_ON_ONCE(vcpu->arch.pio.count);
		WARN_ON_ONCE(vcpu->mmio_needed);
	}

	if (kvm_run->immediate_exit) {
		r = -EINTR;
		goto out;
	}

For userspace I/O and MMIO, KVM requires userspace to "complete" the instruction
that triggered the exit to userspace, e.g. write memory/registers and skip the
instruction as needed.  The immediate_exit flag is set by userspace when userspace
wants to retain control and is doing KVM_RUN purely to placate KVM.  In selftests,
this is done by vcpu_run_complete_io().

The one part I'm a bit surprised by is that this caused ucall problems.  The ucall
framework invokes vcpu_run_complete_io() _after_ it grabs the information. 

	addr = ucall_arch_get_ucall(vcpu);
	if (addr) {
		TEST_ASSERT(addr != (void *)GUEST_UCALL_FAILED,
			    "Guest failed to allocate ucall struct");

		memcpy(uc, addr, sizeof(*uc));
		vcpu_run_complete_io(vcpu);
	} else {
		memset(uc, 0, sizeof(*uc));
	}

Making multiple calls to get_ucall() after a single guest ucall would explain
everything as only the first get_ucall() would succeed, but AFAICT the test doesn't
invoke get_ucall() multiple times.

Aha!  Found it.  _vcpu_run() invokes assert_on_unhandled_exception(), which does

	if (get_ucall(vcpu, &uc) == UCALL_UNHANDLED) {
		uint64_t vector = uc.args[0];

		TEST_FAIL("Unexpected vectored event in guest (vector:0x%lx)",
			  vector);
	}

and thus triggers vcpu_run_complete_io() before demand_paging_test's vcpu_worker()
gets control and does _its_ get_ucall().
Anish Moorthy May 2, 2023, 7:49 p.m. UTC | #3
Thanks for nailing this down for me! One more question: should we be
concerned about any guest memory accesses occurring in the preamble to
that vcpu_run() call in kvm_arch_vcpu_ioctl_run()?

I only see two spots from which an EFAULT could make it to userspace,
those being the sync_regs() and cui() calls. The former looks clean
but I'm not sure about the latter. As written it's not an issue per se
if the cui() call tries a vCPU memory access- the
kvm_populate_efault_info() helper will just not populate the run
struct and WARN_ON_ONCE(). But it would be good to know about.
Sean Christopherson May 2, 2023, 8:41 p.m. UTC | #4
On Tue, May 02, 2023, Anish Moorthy wrote:
> Thanks for nailing this down for me! One more question: should we be
> concerned about any guest memory accesses occurring in the preamble to
> that vcpu_run() call in kvm_arch_vcpu_ioctl_run()?
> 
> I only see two spots from which an EFAULT could make it to userspace,
> those being the sync_regs() and cui() calls. The former looks clean

Ya, sync_regs() is a non-issue, that doesn't touch guest memory unless userspace
is doing something truly bizarre.

> but I'm not sure about the latter. As written it's not an issue per se
> if the cui() call tries a vCPU memory access- the
> kvm_populate_efault_info() helper will just not populate the run
> struct and WARN_ON_ONCE(). But it would be good to know about.

If KVM triggers a WARN_ON_ONCE(), then that's an issue.  Though looking at the
code, the cui() aspect is a moot point.  As I stated in the previous discussion,
the WARN_ON_ONCE() in question needs to be off-by-default.

 : Hmm, one idea would be to have the initial -EFAULT detection fill kvm_run.memory_fault,
 : but set kvm_run.exit_reason to some magic number, e.g. zero it out.  Then KVM could
 : WARN if something tries to overwrite kvm_run.exit_reason.  The WARN would need to
 : be buried by a Kconfig or something since kvm_run can be modified by userspace,
 : but other than that I think it would work.
Anish Moorthy May 2, 2023, 9:46 p.m. UTC | #5
On Tue, May 2, 2023 at 1:41 PM Sean Christopherson <seanjc@google.com> wrote:
>
> If KVM triggers a WARN_ON_ONCE(), then that's an issue.  Though looking at the
> code, the cui() aspect is a moot point.  As I stated in the previous discussion,
> the WARN_ON_ONCE() in question needs to be off-by-default.
>
>  : Hmm, one idea would be to have the initial -EFAULT detection fill kvm_run.memory_fault,
>  : but set kvm_run.exit_reason to some magic number, e.g. zero it out.  Then KVM could
>  : WARN if something tries to overwrite kvm_run.exit_reason.  The WARN would need to
>  : be buried by a Kconfig or something since kvm_run can be modified by userspace,
>  : but other than that I think it would work.

Ah, ok: I thought using WARN_ON_ONCE instead of WARN might have
obviated the Kconfig. I'll go add one.
Sean Christopherson May 2, 2023, 10:31 p.m. UTC | #6
On Tue, May 02, 2023, Anish Moorthy wrote:
> On Tue, May 2, 2023 at 1:41 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > If KVM triggers a WARN_ON_ONCE(), then that's an issue.  Though looking at the
> > code, the cui() aspect is a moot point.  As I stated in the previous discussion,
> > the WARN_ON_ONCE() in question needs to be off-by-default.
> >
> >  : Hmm, one idea would be to have the initial -EFAULT detection fill kvm_run.memory_fault,
> >  : but set kvm_run.exit_reason to some magic number, e.g. zero it out.  Then KVM could
> >  : WARN if something tries to overwrite kvm_run.exit_reason.  The WARN would need to
> >  : be buried by a Kconfig or something since kvm_run can be modified by userspace,
> >  : but other than that I think it would work.
> 
> Ah, ok: I thought using WARN_ON_ONCE instead of WARN might have
> obviated the Kconfig. I'll go add one.

Don't put too much effort into anything at this point.  I'm not entirely convinced
that it's worth carrying a Kconfig for this one-off case (my "suggestion" was mostly
just me spitballing), and at a quick glance through the rest of the series, I'll
definitely have more comments when I do a full review, i.e. things may change too.
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 237c483b12301..ca73eb066af81 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10965,6 +10965,7 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	kvm_run->flags = 0;
 	kvm_load_guest_fpu(vcpu);
 
+	kvm_run->exit_reason = KVM_EXIT_UNKNOWN;
 	kvm_vcpu_srcu_read_lock(vcpu);
 	if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
 		if (kvm_run->immediate_exit) {