diff mbox series

[v4,03/16] KVM: Add KVM_CAP_MEMORY_FAULT_INFO

Message ID 20230602161921.208564-4-amoorthy@google.com (mailing list archive)
State New, archived
Headers show
Series Improve scalability of KVM + userfaultfd live migration via annotated memory faults. | expand

Commit Message

Anish Moorthy June 2, 2023, 4:19 p.m. UTC
KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information
besides a return value of -1 and errno of EFAULT when a vCPU fails an
access to guest memory which may be resolvable by userspace.

Add documentation, updates to the KVM headers, and a helper function
(kvm_populate_efault_info()) for implementing the capability.

Besides simply filling the run struct, kvm_populate_efault_info() takes
two safety measures

  a. It tries to prevent concurrent fills on a single vCPU run struct
     by checking that the run struct being modified corresponds to the
     currently loaded vCPU.
  b. It tries to avoid filling an already-populated run struct by
     checking whether the exit reason has been modified since entry
     into KVM_RUN.

Finally, mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86,
even though EFAULT annotation are currently totally absent. Picking a
point to declare the implementation "done" is difficult because

  1. Annotations will be performed incrementally in subsequent commits
     across both core and arch-specific KVM.
  2. The initial series will very likely miss some cases which need
     annotation. Although these omissions are to be fixed in the future,
     userspace thus still needs to expect and be able to handle
     unannotated EFAULTs.

Given these qualifications, just marking it available here seems the
least arbitrary thing to do.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst             | 42 ++++++++++++++++++++++
 arch/arm64/kvm/arm.c                       |  1 +
 arch/x86/kvm/x86.c                         |  1 +
 include/linux/kvm_host.h                   |  9 +++++
 include/uapi/linux/kvm.h                   | 13 +++++++
 tools/include/uapi/linux/kvm.h             |  7 ++++
 tools/testing/selftests/kvm/lib/kvm_util.c |  1 +
 virt/kvm/kvm_main.c                        | 35 ++++++++++++++++++
 8 files changed, 109 insertions(+)

Comments

Isaku Yamahata June 3, 2023, 4:58 p.m. UTC | #1
On Fri, Jun 02, 2023 at 04:19:08PM +0000,
Anish Moorthy <amoorthy@google.com> wrote:

> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index add067793b90..5b24059143b3 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6700,6 +6700,18 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_MEMORY_FAULT */
> +		struct {
> +			__u64 flags;
> +			__u64 gpa;
> +			__u64 len; /* in bytes */


UPM or gmem uses size instead of len. Personally I don't have any strong
preference.  It's better to converge. (or use union to accept both?)

I check the existing one.
KVM_EXIT_IO: size.
KVM_EXIT_MMIO: len.
KVM_INTERNAL_ERROR_EMULATION: insn_size
struct kvm_coalesced_mmio_zone: size
struct kvm_coalesced_mmio: len
struct kvm_ioeventfd: len
struct kvm_enc_region: size
struct kvm_sev_*: len
struct kvm_memory_attributes: size
struct kvm_create_guest_memfd: size


> +		} memory_fault;
> +
> +Indicates a vCPU memory fault on the guest physical address range
> +[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details.
> +
>  ::
>  
>      /* KVM_EXIT_NOTIFY */
Anish Moorthy June 5, 2023, 4:37 p.m. UTC | #2
On Sat, Jun 3, 2023 at 9:58 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
>
> UPM or gmem uses size instead of len. Personally I don't have any strong
> preference.  It's better to converge. (or use union to accept both?)

I like "len" because to me it implies a contiguous range, whereas
"size" does not: but it's a minor thing. Converging does seem good
though.
Anish Moorthy June 5, 2023, 5:46 p.m. UTC | #3
By the way, I'd like to solicit opinions on the checks that
kvm_populate_efault_info is performing: specifically the the
exit-reason-unset check

On Fri, Jun 2, 2023 at 9:19 AM Anish Moorthy <amoorthy@google.com> wrote:
>
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> +                                    uint64_t gpa, uint64_t len, uint64_t flags)
> +{
>  ...
> +       else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
> +               goto out;

What I intended this check to guard against was the first problematic
case (A) I called out in the cover letter

> The implementation strategy for KVM_CAP_MEMORY_FAULT_INFO has risks: for
> example, if there are any existing paths in KVM_RUN which cause a vCPU
> to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
> access but ignore the failure and then (3) complete the exit to
> userspace set up in (1), then the contents of the kvm_run struct written
> in (1) will be corrupted.

What I wrote was actually incorrect, as you may see: if in (1) the
exit reason != KVM_EXIT_UNKNOWN then the exit-reason-unset check will
prevent writing to the run struct. Now, if for some reason this flow
involved populating most of the run struct in (1) but only setting the
exit reason in (3) then we'd still have a problem: but it's not
feasible to anticipate everything after all :)

I also mentioned a different error case (B)

> Another example: if KVM_RUN fails a guest memory access for which the
> EFAULT is annotated but does not return the EFAULT to userspace, then
> later returns an *un*annotated EFAULT to userspace, then userspace will
> receive incorrect information.

When the second EFAULT is un-annotated the presence/absence of the
exit-reason-unset check is irrelevant: userspace will observe an
annotated EFAULT in place of an un-annotated one either way.

There's also a third interesting case (C) which I didn't mention: an
annotated EFAULT which is ignored/suppressed followed by one which is
propagated to userspace. Here the exit-reason-unset check will prevent
the second annotation from being written, so userspace sees an
annotation with bad contents, If we believe that case (A) is a weird
sequence of events that shouldn't be happening in the first place,
then case (C) seems more important to ensure correctness in. But I
don't know anything about how often (A) happens in KVM, which is why I
want others' opinions.

So, should we drop the exit-reason-unset check (and the accompanying
patch 4) and treat existing occurrences of case (A) as bugs, or should
we maintain the check at the cost of incorrect behavior in case (C)?
Or is there another option here entirely?

Sean, I remember you calling out that some of the emulated mmio code
follows the pattern in (A), but it's been a while and my memory is
fuzzy. What's your opinion here?
Sean Christopherson June 14, 2023, 2:55 p.m. UTC | #4
On Mon, Jun 05, 2023, Anish Moorthy wrote:
> On Sat, Jun 3, 2023 at 9:58 AM Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
> >
> > UPM or gmem uses size instead of len. Personally I don't have any strong
> > preference.  It's better to converge. (or use union to accept both?)
> 
> I like "len" because to me it implies a contiguous range, whereas
> "size" does not: but it's a minor thing. Converging does seem good
> though.

Eh, I don't think we need to converge the two.  "size" is far more common when
describing the properties of a file (the gmem case), whereas "length" is often
used when describing the number of bytes being accessed by a read/write.  I.e.
they're two different things, so using different words to describe them isn't a
bad thing.

Though I suspect by "UPM or gmem" Isakue really meant "struct kvm_memory_attributes".
I don't think we need to converge that one either, though I do agree that "size"
isn't the greatest name.  I vote to rename kvm_memory_attributes's "size" to either
"nr_bytes" or "len".
Sean Christopherson June 14, 2023, 5:35 p.m. UTC | #5
On Fri, Jun 02, 2023, Anish Moorthy wrote:
> +The 'gpa' and 'len' (in bytes) fields describe the range of guest
> +physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' is a
> +bitfield indicating the nature of the access: valid masks are
> +
> +  - KVM_MEMORY_FAULT_FLAG_WRITE:     The failed access was a write.
> +  - KVM_MEMORY_FAULT_FLAG_EXEC:      The failed access was an exec.

We should also define a READ flag, even though it's not strictly necessary.  That
gives userspace another way to detect "bad" data (flags should never be zero), and
it will allow us to use the same RWX bits that KVM_SET_MEMORY_ATTRIBUTES uses,
e.g. R = BIT(0), W = BIT(1), X = BIT(2) (which just so happens to be the same
RWX definitions EPT uses).

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0e571e973bc2..69a221f71914 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2288,4 +2288,13 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
>  /* Max number of entries allowed for each kvm dirty ring */
>  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
>  
> +/*
> + * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
> + * populate the memory_fault field with the given information.
> + *
> + * WARNs and does nothing if the exit reason is not KVM_EXIT_UNKNOWN, or if
> + * 'vcpu' is not the current running vcpu.
> + */
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,

Tagging a globally visible, non-static function as "inline" is odd, to say the
least.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fd80a560378c..09d4d85691e1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4674,6 +4674,9 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  
>  		return r;
>  	}
> +	case KVM_CAP_MEMORY_FAULT_INFO: {

No need for curly braces.  But that's moot because there's no need for this at
all, just let it fall through to the default handling, which KVM already does for
a number of other informational capabilities.

> +		return -EINVAL;
> +	}
>  	default:
>  		return kvm_vm_ioctl_enable_cap(kvm, cap);
>  	}
> @@ -6173,3 +6176,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>  
>  	return init_context.err;
>  }
> +
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,

I strongly prefer to avoid "populate" and "efault".  Avoid "populate" because
that verb will become stale the instance we do anything else in the helper.
Avoid "efault" because it's imprecise, i.e. this isn't to be used for just any
old -EFAULT scenario.  Something like kvm_handle_guest_uaccess_fault()? Definitely
open to other names (especially less verbose names).

> +				     uint64_t gpa, uint64_t len, uint64_t flags)
> +{
> +	if (WARN_ON_ONCE(!vcpu))
> +		return;

Drop this and instead invoke the helper if and only if vCPU is guaranteed to be
valid, e.g. in a future patch, don't add a conditional call to __kvm_write_guest_page(),
just handle the -EFAULT in kvm_vcpu_write_guest_page().  If the concern is that
callers would need to manually check "r == -EFAULT", this helper could take in the
error code, same as we do for kvm_handle_memory_failure(), e.g. do 

	if (r != -EFAULT)
		return;

here.  This is also an argument for using a less prescriptive name for the helper.

> +
> +	preempt_disable();
> +	/*
> +	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> +	 * would open the door for races between concurrent calls to this
> +	 * function.
> +	 */
> +	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> +		goto out;

Meh, this is overkill IMO.  The check in mark_page_dirty_in_slot() is an
abomination that I wish didn't exist, not a pattern that should be copied.  If
we do keep this sanity check, it can simply be

	if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
		return;

because as the comment for kvm_get_running_vcpu() explains, the returned vCPU
pointer won't change even if this task gets migrated to a different pCPU.  If
this code were doing something with vcpu->cpu then preemption would need to be
disabled throughout, but that's not the case.

> +	/*
> +	 * Try not to overwrite an already-populated run struct.
> +	 * This isn't a perfect solution, as there's no guarantee that the exit
> +	 * reason is set before the run struct is populated, but it should prevent
> +	 * at least some bugs.
> +	 */
> +	else if

Kernel style is to not use if-elif-elif if any of the preceding checks are terminal,
i.e. return or goto.  There's a good reason for that style/rule too, as it allows
avoiding weirdness like this where there's a big block comment in the middle of
an if-elif sequence.

> (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))

As I've stated multiple times, this can't WARN in "normal" builds because userspace
can modify kvm_run fields at will.  I do want a WARN as it will allow fuzzers to
find bugs for us, but it needs to be guarded with a Kconfig (or maybe a module
param).  One idea would be to make the proposed CONFIG_KVM_PROVE_MMU[*] a generic
Kconfig and use that.

And this should not be a terminal condition, i.e. KVM should WARN but continue on.
I am like 99% confident there are existing cases where KVM fills exit_reason
without actually exiting, i.e. bailing will immediately "break" KVM.  On the other
hand, clobbering what came before *might* break KVM, but it might work too.  More
thoughts below.

[*] https://lkml.kernel.org/r/20230511235917.639770-8-seanjc%40google.com

> +		goto out;

Folding in your other reply, as I wanted the full original context.

> What I intended this check to guard against was the first problematic
> case (A) I called out in the cover letter
>
> > The implementation strategy for KVM_CAP_MEMORY_FAULT_INFO has risks: for
> > example, if there are any existing paths in KVM_RUN which cause a vCPU
> > to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
> > access but ignore the failure and then (3) complete the exit to
> > userspace set up in (1), then the contents of the kvm_run struct written
> > in (1) will be corrupted.
>
> What I wrote was actually incorrect, as you may see: if in (1) the
> exit reason != KVM_EXIT_UNKNOWN then the exit-reason-unset check will
> prevent writing to the run struct. Now, if for some reason this flow
> involved populating most of the run struct in (1) but only setting the
> exit reason in (3) then we'd still have a problem: but it's not
> feasible to anticipate everything after all :)
>
> I also mentioned a different error case (B)
>
> > Another example: if KVM_RUN fails a guest memory access for which the
> > EFAULT is annotated but does not return the EFAULT to userspace, then
> > later returns an *un*annotated EFAULT to userspace, then userspace will
> > receive incorrect information.
>
> When the second EFAULT is un-annotated the presence/absence of the
> exit-reason-unset check is irrelevant: userspace will observe an
> annotated EFAULT in place of an un-annotated one either way.
>
> There's also a third interesting case (C) which I didn't mention: an
> annotated EFAULT which is ignored/suppressed followed by one which is
> propagated to userspace. Here the exit-reason-unset check will prevent
> the second annotation from being written, so userspace sees an
> annotation with bad contents, If we believe that case (A) is a weird
> sequence of events that shouldn't be happening in the first place,
> then case (C) seems more important to ensure correctn`ess in. But I
> don't know anything about how often (A) happens in KVM, which is why I
> want others' opinions.
>
> So, should we drop the exit-reason-unset check (and the accompanying
> patch 4) and treat existing occurrences of case (A) as bugs, or should
> we maintain the check at the cost of incorrect behavior in case (C)?
> Or is there another option here entirely?
>
> Sean, I remember you calling out that some of the emulated mmio code
> follows the pattern in (A), but it's been a while and my memory is
> fuzzy. What's your opinion here?

I got a bit (ok, way more than a bit) lost in all of the (A) (B) (C) madness.  I
think this what you intended for each case?

  (A) if there are any existing paths in KVM_RUN which cause a vCPU
      to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
      access but ignore the failure and then (3) complete the exit to
      userspace set up in (1), then the contents of the kvm_run struct written
      in (1) will be corrupted.

  (B) if KVM_RUN fails a guest memory access for which the EFAULT is annotated
      but does not return the EFAULT to userspace, then later returns an *un*annotated
      EFAULT to userspace, then userspace will receive incorrect information.

  (C) an annotated EFAULT which is ignored/suppressed followed by one which is
      propagated to userspace. Here the exit-reason-unset check will prevent the
      second annotation from being written, so userspace sees an annotation with
      bad contents, If we believe that case (A) is a weird sequence of events
      that shouldn't be happening in the first place, then case (C) seems more
      important to ensure correctness in. But I don't know anything about how often
      (A) happens in KVM, which is why I want others' opinions.

(A) does sadly happen.  I wouldn't call it a "pattern" though, it's an unfortunate
side effect of deficiencies in KVM's uAPI.

(B) is the trickiest to defend against in the kernel, but as I mentioned in earlier
versions of this series, userspace needs to guard against a vCPU getting stuck in
an infinite fault anyways, so I'm not _that_ concerned with figuring out a way to
address this in the kernel.  KVM's documentation should strongly encourage userspace
to take action if KVM repeatedly exits with the same info over and over, but beyond
that I think anything else is nice to have, not mandatory.

(C) should simply not be possible.  (A) is very much a "this shouldn't happen,
but it does" case.  KVM provides no meaningful guarantees if (A) does happen, so
yes, prioritizing correctness for (C) is far more important.

That said, prioritizing (C) doesn't mean we can't also do our best to play nice
with (A).  None of the existing exits use anywhere near the exit info union's 256
bytes, i.e. there is tons of space to play with.  So rather than put memory_fault
in with all the others, what if we split the union in two, and place memory_fault
in the high half (doesn't have to literally be half, but you get the idea).  It'd
kinda be similar to x86's contributory vs. benign faults; exits that can't be
"nested" or "speculative" go in the low half, and things like memory_fault go in
the high half.

That way, if (A) does occur, the original information will be preserved when KVM
fills memory_fault.  And my suggestion to WARN-and-continue limits the problematic
scenarios to just fields in the second union, i.e. just memory_fault for now.
At the very least, not clobbering would likely make it easier for us to debug when
things go sideways.

And rather than use kvm_run.exit_reason as the canary, we should carve out a
kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
canary in common KVM code, which can't be done for kvm_run.exit_reason because
some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
in KVM_RUN.

E.g. something like this (the #ifdefs are heinous, it might be better to let
userspace see the exit_canary, but make it abundantly clear that it's not ABI).

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 143abb334f56..233702124e0a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -511,16 +511,43 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID     (1 << 0)
                        __u32 flags;
                } notify;
+               /* Fix the size of the union. */
+               char padding[128];
+       };
+
+       /*
+        * This second KVM_EXIT_* union holds structures for exits that may be
+        * triggered after KVM has already initiated a different exit, and/or
+        * may be filled speculatively by KVM.  E.g. because of limitations in
+        * KVM's uAPI, a memory fault can be encountered after an MMIO exit is
+        * initiated and kvm_run.mmio is filled.  Isolating these structures
+        * from the primary KVM_EXIT_* union ensures that KVM won't clobber
+        * information for the original exit.
+        */
+       union {
                /* KVM_EXIT_MEMORY_FAULT */
                struct {
                        __u64 flags;
                        __u64 gpa;
                        __u64 len;
                } memory_fault;
-               /* Fix the size of the union. */
-               char padding[256];
+               /* Fix the size of this union too. */
+#ifndef __KERNEL__
+               char padding2[128];
+#else
+               char padding2[120];
+#endif
        };
 
+#ifdef __KERNEL__
+       /*
+        * Non-ABI, kernel-only field that KVM uses to detect bugs related to
+        * filling exit_reason and the exit unions, e.g. to guard against
+        * clobbering a previous exit.
+        */
+       __u64 exit_canary;
+#endif
+
        /* 2048 is the size of the char array used to bound/pad the size
         * of the union that holds sync regs.
         */
Anish Moorthy June 20, 2023, 9:13 p.m. UTC | #6
Thanks for all the review Sean: I'm busy with other work at the
moment, so I can't address this all atm. But I should have a chance to
take the feedback and send up a new version before too long :)
Kautuk Consul July 5, 2023, 8:21 a.m. UTC | #7
Hi,

> +
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> +				     uint64_t gpa, uint64_t len, uint64_t flags)
> +{
> +	if (WARN_ON_ONCE(!vcpu))
> +		return;
> +
> +	preempt_disable();
> +	/*
> +	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> +	 * would open the door for races between concurrent calls to this
> +	 * function.
> +	 */
> +	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> +		goto out;
Sorry, I also wrote to the v3 discussion for this patch.
Re-iterating what I said there:
Why use WARN_ON_ONCE when there is a clear possiblity of preemption
kicking in (with the possibility of vcpu_load/vcpu_put being called
in the new task) before preempt_disable() is called in this function ?
I think you should use WARN_ON_ONCE only where there is some impossible
or unhandled situation happening, not when there is a possibility of that
situation clearly happening as per the kernel code. I think that this WARN_ON_ONCE
could make sense if kvm_populate_efault_info() is called from atomic context,
but not when you are disabling preemption from this function itself.
Basically I don't think there is any way we can guarantee that
preemption DOESN'T kick in before the preempt_disable() such that
this if-check is actually something that deserves to have a kernel
WARN_ON_ONCE() warning.
Can we get rid of this WARN_ON_ONCE and straightaway jump to the
out label if "(vcpu != __this_cpu_read(kvm_running_vcpu))" is true, or
please do correct me if I am wrong about something ?
> +	/*
> +	 * Try not to overwrite an already-populated run struct.
> +	 * This isn't a perfect solution, as there's no guarantee that the exit
> +	 * reason is set before the run struct is populated, but it should prevent
> +	 * at least some bugs.
> +	 */
> +	else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
> +		goto out;
> +
> +	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> +	vcpu->run->memory_fault.gpa = gpa;
> +	vcpu->run->memory_fault.len = len;
> +	vcpu->run->memory_fault.flags = flags;
> +
> +out:
> +	preempt_enable();
> +}
> -- 
> 2.41.0.rc0.172.g3f132b7071-goog
>
Kautuk Consul July 7, 2023, 11:50 a.m. UTC | #8
Hi,
> 
> > +
> > +	preempt_disable();
> > +	/*
> > +	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> > +	 * would open the door for races between concurrent calls to this
> > +	 * function.
> > +	 */
> > +	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> > +		goto out;
> 
> Meh, this is overkill IMO.  The check in mark_page_dirty_in_slot() is an
> abomination that I wish didn't exist, not a pattern that should be copied.  If
> we do keep this sanity check, it can simply be
> 
> 	if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
> 		return;
> 
> because as the comment for kvm_get_running_vcpu() explains, the returned vCPU
> pointer won't change even if this task gets migrated to a different pCPU.  If
> this code were doing something with vcpu->cpu then preemption would need to be
> disabled throughout, but that's not the case.
> 
I think that this check is needed but without the WARN_ON_ONCE as per my 
other comment.
Reason is that we really need to insulate the code against preemption
kicking in before the call to preempt_disable() as the logic seems to
need this check to avoid concurrency problems.
(I don't think Anish simply copied this if-check from mark_page_dirty_in_slot)
Anish Moorthy July 10, 2023, 3 p.m. UTC | #9
> > > +   preempt_disable();
> > > +   /*
> > > +    * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> > > +    * would open the door for races between concurrent calls to this
> > > +    * function.
> > > +    */
> > > +   if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> > > +           goto out;
> >
> > Meh, this is overkill IMO.  The check in mark_page_dirty_in_slot() is an
> > abomination that I wish didn't exist, not a pattern that should be copied.  If
> > we do keep this sanity check, it can simply be
> >
> >       if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
> >               return;
> >
> > because as the comment for kvm_get_running_vcpu() explains, the returned vCPU
> > pointer won't change even if this task gets migrated to a different pCPU.  If
> > this code were doing something with vcpu->cpu then preemption would need to be
> > disabled throughout, but that's not the case.
> >
> I think that this check is needed but without the WARN_ON_ONCE as per my
> other comment.
> Reason is that we really need to insulate the code against preemption
> kicking in before the call to preempt_disable() as the logic seems to
> need this check to avoid concurrency problems.
> (I don't think Anish simply copied this if-check from mark_page_dirty_in_slot)

Heh, you're giving me too much credit here. I did copy-paste this
check, not from from mark_page_dirty_in_slot() but from one of Sean's
old responses [1]

> That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> without the target vCPU being loaded:
>
>     int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
>     {
>         preempt_disable();
>         if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
>             goto out;
>
>         vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
>         ...
>     out:
>         preempt_enable();
>         return -EFAULT;
>     }

Ancient history aside, let's figure out what's really needed here.

> Why use WARN_ON_ONCE when there is a clear possiblity of preemption
> kicking in (with the possibility of vcpu_load/vcpu_put being called
> in the new task) before preempt_disable() is called in this function ?
> I think you should use WARN_ON_ONCE only where there is some impossible
> or unhandled situation happening, not when there is a possibility of that
> situation clearly happening as per the kernel code.

I did some mucking around to try and understand the kvm_running_vcpu
variable, and I don't think preemption/rescheduling actually trips the
WARN here? From my (limited) understanding, it seems that the
thread being preempted will cause a vcpu_put() via kvm_sched_out().
But when the thread is eventually scheduled back in onto whatever
core, it'll vcpu_load() via kvm_sched_in(), and the docstring for
kvm_get_running_vcpu() seems to imply the thing that vcpu_load()
stores into the per-cpu "kvm_running_vcpu" variable will be the same
thing which would have been observed before preemption.

All that's to say: I wouldn't expect the value of
"__this_cpu_read(kvm_running_vcpu)" to change in any given thread. If
that's true, then the things I would expect this WARN to catch are (a)
bugs where somehow the thread gets scheduled without calling
vcpu_load() or (b) bizarre situations (probably all bugs?) where some
vcpu thread has a hold of some _other_ kvm_vcpu* and is trying to do
something with it.


On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Meh, this is overkill IMO.  The check in mark_page_dirty_in_slot() is an
> abomination that I wish didn't exist, not a pattern that should be copied.  If
> we do keep this sanity check, it can simply be
>
>         if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
>                 return;
>
> because as the comment for kvm_get_running_vcpu() explains, the returned vCPU
> pointer won't change even if this task gets migrated to a different pCPU.  If
> this code were doing something with vcpu->cpu then preemption would need to be
> disabled throughout, but that's not the case.

Oh, looks like Sean said the same thing. Guess I probably should have
read that reply more closely first :/


[1] https://lore.kernel.org/kvm/ZBnLaidtZEM20jMp@google.com/
Kautuk Consul July 11, 2023, 3:54 a.m. UTC | #10
> > >
> > I think that this check is needed but without the WARN_ON_ONCE as per my
> > other comment.
> > Reason is that we really need to insulate the code against preemption
> > kicking in before the call to preempt_disable() as the logic seems to
> > need this check to avoid concurrency problems.
> > (I don't think Anish simply copied this if-check from mark_page_dirty_in_slot)
> 
> Heh, you're giving me too much credit here. I did copy-paste this
> check, not from from mark_page_dirty_in_slot() but from one of Sean's
> old responses [1]
Oh, I see.
> 
> > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> > without the target vCPU being loaded:
> >
> >     int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
> >     {
> >         preempt_disable();
> >         if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> >             goto out;
> >
> >         vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> >         ...
> >     out:
> >         preempt_enable();
> >         return -EFAULT;
> >     }
> 
> Ancient history aside, let's figure out what's really needed here.
> 
> > Why use WARN_ON_ONCE when there is a clear possiblity of preemption
> > kicking in (with the possibility of vcpu_load/vcpu_put being called
> > in the new task) before preempt_disable() is called in this function ?
> > I think you should use WARN_ON_ONCE only where there is some impossible
> > or unhandled situation happening, not when there is a possibility of that
> > situation clearly happening as per the kernel code.
> 
> I did some mucking around to try and understand the kvm_running_vcpu
> variable, and I don't think preemption/rescheduling actually trips the
> WARN here? From my (limited) understanding, it seems that the
> thread being preempted will cause a vcpu_put() via kvm_sched_out().
> But when the thread is eventually scheduled back in onto whatever
> core, it'll vcpu_load() via kvm_sched_in(), and the docstring for
> kvm_get_running_vcpu() seems to imply the thing that vcpu_load()
> stores into the per-cpu "kvm_running_vcpu" variable will be the same
> thing which would have been observed before preemption.
> 
> All that's to say: I wouldn't expect the value of
> "__this_cpu_read(kvm_running_vcpu)" to change in any given thread. If
> that's true, then the things I would expect this WARN to catch are (a)
> bugs where somehow the thread gets scheduled without calling
> vcpu_load() or (b) bizarre situations (probably all bugs?) where some
> vcpu thread has a hold of some _other_ kvm_vcpu* and is trying to do
> something with it.
Oh I completely missed the scheduling path for KVM.
But since vcpu_put and vcpu_load are exported symbols, I wonder what'll
happen when there are calls to these functions from places other
than kvm_sched_in() and kvm_sched_out() ? Just thinking out loud.
> 
> 
> On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Meh, this is overkill IMO.  The check in mark_page_dirty_in_slot() is an
> > abomination that I wish didn't exist, not a pattern that should be copied.  If
> > we do keep this sanity check, it can simply be
> >
> >         if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu()))
> >                 return;
> >
> > because as the comment for kvm_get_running_vcpu() explains, the returned vCPU
> > pointer won't change even if this task gets migrated to a different pCPU.  If
> > this code were doing something with vcpu->cpu then preemption would need to be
> > disabled throughout, but that's not the case.
> 
> Oh, looks like Sean said the same thing. Guess I probably should have
> read that reply more closely first :/
I too get less time to completely read through emails and
associated code between my work assignments. :-).
> 
> 
> [1] https://lore.kernel.org/kvm/ZBnLaidtZEM20jMp@google.com/
Sean Christopherson July 11, 2023, 2:25 p.m. UTC | #11
On Tue, Jul 11, 2023, Kautuk Consul wrote:
> > > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> > > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> > > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> > > without the target vCPU being loaded:
> > >
> > >     int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
> > >     {
> > >         preempt_disable();
> > >         if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> > >             goto out;
> > >
> > >         vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > >         ...
> > >     out:
> > >         preempt_enable();
> > >         return -EFAULT;
> > >     }
> > 
> > Ancient history aside, let's figure out what's really needed here.
> > 
> > > Why use WARN_ON_ONCE when there is a clear possiblity of preemption
> > > kicking in (with the possibility of vcpu_load/vcpu_put being called
> > > in the new task) before preempt_disable() is called in this function ?
> > > I think you should use WARN_ON_ONCE only where there is some impossible
> > > or unhandled situation happening, not when there is a possibility of that
> > > situation clearly happening as per the kernel code.
> > 
> > I did some mucking around to try and understand the kvm_running_vcpu
> > variable, and I don't think preemption/rescheduling actually trips the
> > WARN here? From my (limited) understanding, it seems that the
> > thread being preempted will cause a vcpu_put() via kvm_sched_out().
> > But when the thread is eventually scheduled back in onto whatever
> > core, it'll vcpu_load() via kvm_sched_in(), and the docstring for
> > kvm_get_running_vcpu() seems to imply the thing that vcpu_load()
> > stores into the per-cpu "kvm_running_vcpu" variable will be the same
> > thing which would have been observed before preemption.
> > 
> > All that's to say: I wouldn't expect the value of
> > "__this_cpu_read(kvm_running_vcpu)" to change in any given thread. If
> > that's true, then the things I would expect this WARN to catch are (a)
> > bugs where somehow the thread gets scheduled without calling
> > vcpu_load() or (b) bizarre situations (probably all bugs?) where some
> > vcpu thread has a hold of some _other_ kvm_vcpu* and is trying to do
> > something with it.
> Oh I completely missed the scheduling path for KVM.
> But since vcpu_put and vcpu_load are exported symbols, I wonder what'll
> happen when there are calls to these functions from places other
> than kvm_sched_in() and kvm_sched_out() ? Just thinking out loud.

Invoking this helper without the target vCPU loaded on the current task would be
considered a bug.  kvm.ko exports a rather disgusting number of symbols purely for
use by vendor modules, e.g. kvm-intel.ko and kvm-amd.ko on x86.  The exports are
not at all intended to be used by non-KVM code, i.e. any such misuse would also be
considered a bug.
Anish Moorthy Aug. 11, 2023, 10:12 p.m. UTC | #12
On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
>
> Tagging a globally visible, non-static function as "inline" is odd, to say the
> least.

I think my eyes glaze over whenever I read the words "translation
unit" (my brain certainly does) so I'll have to take your word for it.
IIRC last time I tried to mark this function "static" the compiler
yelled at me, so removing the "inline" it is.

> > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
>
> I strongly prefer to avoid "populate" and "efault".  Avoid "populate" because
> that verb will become stale the instance we do anything else in the helper.
> Avoid "efault" because it's imprecise, i.e. this isn't to be used for just any
> old -EFAULT scenario.  Something like kvm_handle_guest_uaccess_fault()? Definitely
> open to other names (especially less verbose names).

I've taken the kvm_handle_guest_uaccess_fault() name for now, though I
remember you saying something about "uaccess" names being bad because
they'll become inaccurate once GPM rolls around? I'll circle back on
the names before sending v5 out.

> > (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
>
> As I've stated multiple times, this can't WARN in "normal" builds because userspace
> can modify kvm_run fields at will.  I do want a WARN as it will allow fuzzers to
> find bugs for us, but it needs to be guarded with a Kconfig (or maybe a module
> param).  One idea would be to make the proposed CONFIG_KVM_PROVE_MMU[*] a generic
> Kconfig and use that.

For now I've added a KVM_WARN_MEMORY_FAULT_ANNOTATE_ALREADY_POPULATED
kconfig: open to suggestions on the name.

> I got a bit (ok, way more than a bit) lost in all of the (A) (B) (C) madness.  I
> think this what you intended for each case?
>
>   (A) if there are any existing paths in KVM_RUN which cause a vCPU
>       to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
>       access but ignore the failure and then (3) complete the exit to
>       userspace set up in (1), then the contents of the kvm_run struct written
>       in (1) will be corrupted.
>
>   (B) if KVM_RUN fails a guest memory access for which the EFAULT is annotated
>       but does not return the EFAULT to userspace, then later returns an *un*annotated
>       EFAULT to userspace, then userspace will receive incorrect information.
>
>   (C) an annotated EFAULT which is ignored/suppressed followed by one which is
>       propagated to userspace. Here the exit-reason-unset check will prevent the
>       second annotation from being written, so userspace sees an annotation with
>       bad contents, If we believe that case (A) is a weird sequence of events
>       that shouldn't be happening in the first place, then case (C) seems more
>       important to ensure correctness in. But I don't know anything about how often
>       (A) happens in KVM, which is why I want others' opinions.

Yeah, I got lost in the weeds: you've gotten the important points though

> (A) does sadly happen.  I wouldn't call it a "pattern" though, it's an unfortunate
> side effect of deficiencies in KVM's uAPI.
>
> (B) is the trickiest to defend against in the kernel, but as I mentioned in earlier
> versions of this series, userspace needs to guard against a vCPU getting stuck in
> an infinite fault anyways, so I'm not _that_ concerned with figuring out a way to
> address this in the kernel.  KVM's documentation should strongly encourage userspace
> to take action if KVM repeatedly exits with the same info over and over, but beyond
> that I think anything else is nice to have, not mandatory.
>
> (C) should simply not be possible.  (A) is very much a "this shouldn't happen,
> but it does" case.  KVM provides no meaningful guarantees if (A) does happen, so
> yes, prioritizing correctness for (C) is far more important.
>
> That said, prioritizing (C) doesn't mean we can't also do our best to play nice
> with (A).  None of the existing exits use anywhere near the exit info union's 256
> bytes, i.e. there is tons of space to play with.  So rather than put memory_fault
> in with all the others, what if we split the union in two, and place memory_fault
> in the high half (doesn't have to literally be half, but you get the idea).  It'd
> kinda be similar to x86's contributory vs. benign faults; exits that can't be
> "nested" or "speculative" go in the low half, and things like memory_fault go in
> the high half.
>
> That way, if (A) does occur, the original information will be preserved when KVM
> fills memory_fault.  And my suggestion to WARN-and-continue limits the problematic
> scenarios to just fields in the second union, i.e. just memory_fault for now.
> At the very least, not clobbering would likely make it easier for us to debug when
> things go sideways.
>
> And rather than use kvm_run.exit_reason as the canary, we should carve out a
> kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
> canary in common KVM code, which can't be done for kvm_run.exit_reason because
> some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
> in KVM_RUN.

I think this is a good idea :D I was going to suggest something
similar a while back, but I thought it would be off the table- whoops.

My one concern is that if/when other features eventually also use the
"speculative" portion, then they're going to run into the same issues
as we're trying to avoid here. But fixing *that* (probably by
propagating these exits through return values/the call stack) would be
a really big refactor, and C doesn't really have the type system for
it in the first place :(
Sean Christopherson Aug. 14, 2023, 6:01 p.m. UTC | #13
On Fri, Aug 11, 2023, Anish Moorthy wrote:
> On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> >
> > Tagging a globally visible, non-static function as "inline" is odd, to say the
> > least.
> 
> I think my eyes glaze over whenever I read the words "translation
> unit" (my brain certainly does) so I'll have to take your word for it.
> IIRC last time I tried to mark this function "static" the compiler
> yelled at me, so removing the "inline" it is.

What is/was the error?  It's probably worth digging into; "static inline" should
work just fine, so there might be something funky elsewhere that you're papering
over.

> > I got a bit (ok, way more than a bit) lost in all of the (A) (B) (C) madness.  I
> > think this what you intended for each case?
> >
> >   (A) if there are any existing paths in KVM_RUN which cause a vCPU
> >       to (1) populate the kvm_run struct then (2) fail a vCPU guest memory
> >       access but ignore the failure and then (3) complete the exit to
> >       userspace set up in (1), then the contents of the kvm_run struct written
> >       in (1) will be corrupted.
> >
> >   (B) if KVM_RUN fails a guest memory access for which the EFAULT is annotated
> >       but does not return the EFAULT to userspace, then later returns an *un*annotated
> >       EFAULT to userspace, then userspace will receive incorrect information.
> >
> >   (C) an annotated EFAULT which is ignored/suppressed followed by one which is
> >       propagated to userspace. Here the exit-reason-unset check will prevent the
> >       second annotation from being written, so userspace sees an annotation with
> >       bad contents, If we believe that case (A) is a weird sequence of events
> >       that shouldn't be happening in the first place, then case (C) seems more
> >       important to ensure correctness in. But I don't know anything about how often
> >       (A) happens in KVM, which is why I want others' opinions.
> 
> Yeah, I got lost in the weeds: you've gotten the important points though
> 
> > (A) does sadly happen.  I wouldn't call it a "pattern" though, it's an unfortunate
> > side effect of deficiencies in KVM's uAPI.
> >
> > (B) is the trickiest to defend against in the kernel, but as I mentioned in earlier
> > versions of this series, userspace needs to guard against a vCPU getting stuck in
> > an infinite fault anyways, so I'm not _that_ concerned with figuring out a way to
> > address this in the kernel.  KVM's documentation should strongly encourage userspace
> > to take action if KVM repeatedly exits with the same info over and over, but beyond
> > that I think anything else is nice to have, not mandatory.
> >
> > (C) should simply not be possible.  (A) is very much a "this shouldn't happen,
> > but it does" case.  KVM provides no meaningful guarantees if (A) does happen, so
> > yes, prioritizing correctness for (C) is far more important.
> >
> > That said, prioritizing (C) doesn't mean we can't also do our best to play nice
> > with (A).  None of the existing exits use anywhere near the exit info union's 256
> > bytes, i.e. there is tons of space to play with.  So rather than put memory_fault
> > in with all the others, what if we split the union in two, and place memory_fault
> > in the high half (doesn't have to literally be half, but you get the idea).  It'd
> > kinda be similar to x86's contributory vs. benign faults; exits that can't be
> > "nested" or "speculative" go in the low half, and things like memory_fault go in
> > the high half.
> >
> > That way, if (A) does occur, the original information will be preserved when KVM
> > fills memory_fault.  And my suggestion to WARN-and-continue limits the problematic
> > scenarios to just fields in the second union, i.e. just memory_fault for now.
> > At the very least, not clobbering would likely make it easier for us to debug when
> > things go sideways.
> >
> > And rather than use kvm_run.exit_reason as the canary, we should carve out a
> > kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
> > canary in common KVM code, which can't be done for kvm_run.exit_reason because
> > some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
> > in KVM_RUN.
> 
> I think this is a good idea :D I was going to suggest something
> similar a while back, but I thought it would be off the table- whoops.
> 
> My one concern is that if/when other features eventually also use the
> "speculative" portion, then they're going to run into the same issues
> as we're trying to avoid here.

I think it's worth the risk.  We could mitigate potential future problems to some
degree by maintaining the last N "speculative" user exits since KVM_RUN, e.g. with
a ring buffer, but (a) that's more than a bit crazy and (b) I don't think the
extra data would be actionable for userspace unless userspace somehow had a priori
knowledge of the "failing" sequence.

> But fixing *that* (probably by propagating these exits through return
> values/the call stack) would be a really big refactor, and C doesn't really
> have the type system for it in the first place :(

Yeah, lack of a clean and easy way to return a tuple makes it all but impossible
to handle this without resorting to evil shenanigans.
Anish Moorthy Aug. 15, 2023, 12:06 a.m. UTC | #14
On Mon, Aug 14, 2023 at 11:01 AM Sean Christopherson <seanjc@google.com> wrote:
>
> What is/was the error?  It's probably worth digging into; "static inline" should
> work just fine, so there might be something funky elsewhere that you're papering
> over.

What I get is

> ./include/linux/kvm_host.h:2298:20: error: function 'kvm_handle_guest_uaccess_fault' has internal linkage but is not defined [-Werror,-Wundefined-internal]
> static inline void kvm_handle_guest_uaccess_fault(struct kvm_vcpu *vcpu,
>                    ^
> arch/x86/kvm/mmu/mmu.c:3323:2: note: used here
>         kvm_handle_guest_uaccess_fault(vcpu, gfn_to_gpa(fault->gfn), PAGE_SIZE,
>         ^
> 1 error generated.

(mmu.c:3323 is in kvm_handle_error_pfn()). I tried shoving the
definition of the function from kvm_main.c to kvm_host.h so that I
could make it "static inline": but then the same "internal linkage"
error pops up in the kvm_vcpu_read/write_guest_page() functions.

Btw, do you actually know the size of the union in the run struct? I
started checking it but stopped when I realized that it includes
arch-dependent structs.
Sean Christopherson Aug. 15, 2023, 12:43 a.m. UTC | #15
On Mon, Aug 14, 2023, Anish Moorthy wrote:
> On Mon, Aug 14, 2023 at 11:01 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > What is/was the error?  It's probably worth digging into; "static inline" should
> > work just fine, so there might be something funky elsewhere that you're papering
> > over.
> 
> What I get is
> 
> > ./include/linux/kvm_host.h:2298:20: error: function 'kvm_handle_guest_uaccess_fault' has internal linkage but is not defined [-Werror,-Wundefined-internal]
> > static inline void kvm_handle_guest_uaccess_fault(struct kvm_vcpu *vcpu,
> >                    ^
> > arch/x86/kvm/mmu/mmu.c:3323:2: note: used here
> >         kvm_handle_guest_uaccess_fault(vcpu, gfn_to_gpa(fault->gfn), PAGE_SIZE,
> >         ^
> > 1 error generated.
> 
> (mmu.c:3323 is in kvm_handle_error_pfn()). I tried shoving the
> definition of the function from kvm_main.c to kvm_host.h so that I
> could make it "static inline": but then the same "internal linkage"
> error pops up in the kvm_vcpu_read/write_guest_page() functions.

Can you point me at your branch?  That should be easy to resolve, but it's all
but impossible to figure out what's going wrong without being able to see the
full code.

> Btw, do you actually know the size of the union in the run struct? I
> started checking it but stopped when I realized that it includes
> arch-dependent structs.

256 bytes, though how much of that is actually free for the "speculative" idea...

		/* Fix the size of the union. */
		char padding[256];

Well fudge.  PPC's KVM_EXIT_OSI actually uses all 256 bytes.  And KVM_EXIT_SYSTEM_EVENT
is closer to the limit than I'd like.

On the other hand, despite burning 2048 bytes for kvm_sync_regs, all of kvm_run
is only 2352 bytes, i.e. we have plenty of room in the 4KiB page.  So we could
throw the "speculative" exits in a completely different union.  But that would
be cumbersome for userspace.

Hrm.  The best option would probably be to have a "nested" or "previous" exit union,
and copy the existing exit information to that field prior to filling a new exit
reason.  But that would require an absolute insane amount of refactoring because
everything just writes the fields directly. *sigh*

I suppose we could copy the information into two places for "speculative" exits,
the actual exit union and a separate "speculative" field.  I might be grasping at
straws though, not sure that ugliness would be worth making it slightly easier to
deal with the (A) scenario from earlier.

FWIW, my trick for quick finding the real size is to feed the size+1 into an alignment.
Unless you get really unlucky, that alignment will be illegal and the compiler
will tell you the size, e.g. 

arch/x86/kvm/x86.c:13405:9: error: requested alignment ‘2353’ is not a positive power of 2
13405 |         unsigned int len __aligned(sizeof(*run) + 1);
      |         ^~~~~~~~
Anish Moorthy Aug. 15, 2023, 5:01 p.m. UTC | #16
On Fri, Aug 11, 2023 at 3:12 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> >
> > Tagging a globally visible, non-static function as "inline" is odd, to say the
> > least.
>
> I think my eyes glaze over whenever I read the words "translation
> unit" (my brain certainly does) so I'll have to take your word for it.
> IIRC last time I tried to mark this function "static" the compiler
> yelled at me, so removing the "inline" it is.
>
>...
>
On Mon, Aug 14, 2023 at 5:43 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Can you point me at your branch?  That should be easy to resolve, but it's all
> but impossible to figure out what's going wrong without being able to see the
> full code.

Sure: https://github.com/anlsh/linux/tree/suffd-kvm-staticinline.
Don't worry about this unless you're bored though: I only called out
my change because I wanted to make sure the final signature was fine.
If you say it should be static inline then I can take a more concerted
stab at learning/figuring out what's going on here.

> > Btw, do you actually know the size of the union in the run struct? I
> > started checking it but stopped when I realized that it includes
> > arch-dependent structs.
>
> 256 bytes, though how much of that is actually free for the "speculative" idea...
>
>                 /* Fix the size of the union. */
>                 char padding[256];
>
> Well fudge.  PPC's KVM_EXIT_OSI actually uses all 256 bytes.  And KVM_EXIT_SYSTEM_EVENT
> is closer to the limit than I'd like
>
> On the other hand, despite burning 2048 bytes for kvm_sync_regs, all of kvm_run
> is only 2352 bytes, i.e. we have plenty of room in the 4KiB page.  So we could
> throw the "speculative" exits in a completely different union.  But that would
> be cumbersome for userspace.

Haha, well it's a good thing we checked. What about an extra union
would be cumbersome for userspace though? From an API perspective it
doesn't seem like splitting the current struct or adding an extra one
would be all too different- is it something about needing to recompile
things due to the struct size change?
Sean Christopherson Aug. 16, 2023, 3:58 p.m. UTC | #17
On Tue, Aug 15, 2023, Anish Moorthy wrote:
> On Fri, Aug 11, 2023 at 3:12 PM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> > >
> > > Tagging a globally visible, non-static function as "inline" is odd, to say the
> > > least.
> >
> > I think my eyes glaze over whenever I read the words "translation
> > unit" (my brain certainly does) so I'll have to take your word for it.
> > IIRC last time I tried to mark this function "static" the compiler
> > yelled at me, so removing the "inline" it is.
> >
> >...
> >
> On Mon, Aug 14, 2023 at 5:43 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Can you point me at your branch?  That should be easy to resolve, but it's all
> > but impossible to figure out what's going wrong without being able to see the
> > full code.
> 
> Sure: https://github.com/anlsh/linux/tree/suffd-kvm-staticinline.
> Don't worry about this unless you're bored though: I only called out
> my change because I wanted to make sure the final signature was fine.
> If you say it should be static inline then I can take a more concerted
> stab at learning/figuring out what's going on here.

That branch builds (and looks) just fine on gcc-12 and clang-14.  Maybe you have
stale objects in your build directory?  Or maybe PEBKAC?
 
> > > Btw, do you actually know the size of the union in the run struct? I
> > > started checking it but stopped when I realized that it includes
> > > arch-dependent structs.
> >
> > 256 bytes, though how much of that is actually free for the "speculative" idea...
> >
> >                 /* Fix the size of the union. */
> >                 char padding[256];
> >
> > Well fudge.  PPC's KVM_EXIT_OSI actually uses all 256 bytes.  And KVM_EXIT_SYSTEM_EVENT
> > is closer to the limit than I'd like
> >
> > On the other hand, despite burning 2048 bytes for kvm_sync_regs, all of kvm_run
> > is only 2352 bytes, i.e. we have plenty of room in the 4KiB page.  So we could
> > throw the "speculative" exits in a completely different union.  But that would
> > be cumbersome for userspace.
> 
> Haha, well it's a good thing we checked. What about an extra union
> would be cumbersome for userspace though? From an API perspective it
> doesn't seem like splitting the current struct or adding an extra one
> would be all too different- is it something about needing to recompile
> things due to the struct size change?

I was thinking that we couldn't have two anonymous unions, and so userspace (and
KVM) would need to do something like

	run->exit2.memory_fault.gpa

instead of 

	run->memory_fault.gpa

but the names just need to be unique, e.g. the below compiles just fine.  So unless
someone has a better idea, using a separate union for exits that might be clobbered
seems like the way to go.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5bdda75bfd10..fc3701d835d6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3289,6 +3289,9 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
                return RET_PF_RETRY;
        }
 
+       vcpu->run->memory_fault.flags = 0;
+       vcpu->run->memory_fault.gpa = fault->gfn << PAGE_SHIFT;
+       vcpu->run->memory_fault.len = PAGE_SIZE;
        return -EFAULT;
 }
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f089ab290978..1a8ccd5f949a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -531,6 +531,18 @@ struct kvm_run {
                struct kvm_sync_regs regs;
                char padding[SYNC_REGS_SIZE_BYTES];
        } s;
+
+       /* Anonymous union for exits #2. */
+       union {
+               /* KVM_EXIT_MEMORY_FAULT */
+               struct {
+                       __u64 flags;
+                       __u64 gpa;
+                       __u64 len; /* in bytes */
+               } memory_fault;
+
+               char padding2[256];
+       };
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
Anish Moorthy Aug. 16, 2023, 9:28 p.m. UTC | #18
On Wed, Aug 16, 2023 at 8:58 AM Sean Christopherson <seanjc@google.com> wrote:
>
> That branch builds (and looks) just fine on gcc-12 and clang-14.  Maybe you have
> stale objects in your build directory?  Or maybe PEBKAC?

Hmm, so it does- PEBKAC indeed...

> I was thinking that we couldn't have two anonymous unions, and so userspace (and
> KVM) would need to do something like
>
>         run->exit2.memory_fault.gpa
>
> instead of
>
>         run->memory_fault.gpa
>
> but the names just need to be unique, e.g. the below compiles just fine.  So unless
> someone has a better idea, using a separate union for exits that might be clobbered
> seems like the way to go.

Agreed. By the way, what was the reason why you wanted to avoid the
exit reason canary being ABI?

On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> And rather than use kvm_run.exit_reason as the canary, we should carve out a
> kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
> canary in common KVM code, which can't be done for kvm_run.exit_reason because
> some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
> in KVM_RUN.
>
> E.g. something like this (the #ifdefs are heinous, it might be better to let
> userspace see the exit_canary, but make it abundantly clear that it's not ABI).
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 143abb334f56..233702124e0a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -511,16 +511,43 @@ struct kvm_run {
> +       /*
> +        * This second KVM_EXIT_* union holds structures for exits that may be
> +        * triggered after KVM has already initiated a different exit, and/or
> +        * may be filled speculatively by KVM.  E.g. because of limitations in
> +        * KVM's uAPI, a memory fault can be encountered after an MMIO exit is
> +        * initiated and kvm_run.mmio is filled.  Isolating these structures
> +        * from the primary KVM_EXIT_* union ensures that KVM won't clobber
> +        * information for the original exit.
> +        */
> +       union {
>                 /* KVM_EXIT_MEMORY_FAULT */
>                 blahblahblah
> +#endif
>         };
>
> +#ifdef __KERNEL__
> +       /*
> +        * Non-ABI, kernel-only field that KVM uses to detect bugs related to
> +        * filling exit_reason and the exit unions, e.g. to guard against
> +        * clobbering a previous exit.
> +        */
> +       __u64 exit_canary;
> +#endif
> +

We can't set exit_reason in the kvm_handle_guest_uaccess_fault()
helper if we're to handle case A (the setup vcpu exit -> fail guest
memory access -> return to userspace) correctly. But then userspace
needs some other way to check whether an efault is annotated, and
might as well check the canary, so something like

> +       __u32 speculative_exit_reason;
> +       union {
> +               /* KVM_SPEC_EXIT_MEMORY_FAULT */
> +               struct {
> +                       __u64 flags;
> +                       __u64 gpa;
> +                       __u64 len;
> +               } memory_fault;
> +               /* Fix the size of the union. */
> +               char speculative_padding[256];
> +       };

With the condition for an annotated efault being "if kvm_run returns
-EFAULT and speculative_exit_reason is..."
Anish Moorthy Aug. 17, 2023, 10:55 p.m. UTC | #19
Documentation update for KVM_CAP_MEMORY_FAULT_INFO

> KVM_CAP_MEMORY_FAULT_INFO
> -------------------------------------------

Old:
> -The presence of this capability indicates that KVM_RUN may annotate EFAULTs
> -returned by KVM_RUN in response to failed vCPU guest memory accesses which
> -userspace may be able to resolve.

New:
> +The presence of this capability indicates that KVM_RUN may fill
> +kvm_run.memory_fault in response to failed guest memory accesses in a vCPU
> +context.

On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> (B) is the trickiest to defend against in the kernel, but as I mentioned in earlier
> versions of this series, userspace needs to guard against a vCPU getting stuck in
> an infinite fault anyways, so I'm not _that_ concerned with figuring out a way to
> address this in the kernel.  KVM's documentation should strongly encourage userspace
> to take action if KVM repeatedly exits with the same info over and over, but beyond
> that I think anything else is nice to have, not mandatory.

In response to that, I've added the following bit as well:

+Note: Userspaces which attempt to resolve memory faults so that they can retry
+KVM_RUN are encouraged to guard against repeatedly receiving the same
+error/annotated fault.
Sean Christopherson Aug. 17, 2023, 11:58 p.m. UTC | #20
On Wed, Aug 16, 2023, Anish Moorthy wrote:
> > but the names just need to be unique, e.g. the below compiles just fine.  So unless
> > someone has a better idea, using a separate union for exits that might be clobbered
> > seems like the way to go.
> 
> Agreed. By the way, what was the reason why you wanted to avoid the
> exit reason canary being ABI?

Because it doesn't need to be exposed to usersepace, and it would be quite
unfortunate if we ever need/want to drop the canary, but can't because it's exposed
to userspace.

Though I have no idea why I suggested it be placed in kvm_run, the canary can simply
go in kvm_vcpu.  I'm guessing I was going for code locality, but abusing an
#ifdef to achieve that is silly.

> On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > And rather than use kvm_run.exit_reason as the canary, we should carve out a
> > kernel-only, i.e. non-ABI, field from the union.  That would allow setting the
> > canary in common KVM code, which can't be done for kvm_run.exit_reason because
> > some architectures, e.g. s390 (and x86 IIRC), consume the exit_reason early on
> > in KVM_RUN.
> >
> > E.g. something like this (the #ifdefs are heinous, it might be better to let
> > userspace see the exit_canary, but make it abundantly clear that it's not ABI).
> >
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 143abb334f56..233702124e0a 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -511,16 +511,43 @@ struct kvm_run {
> > +       /*
> > +        * This second KVM_EXIT_* union holds structures for exits that may be
> > +        * triggered after KVM has already initiated a different exit, and/or
> > +        * may be filled speculatively by KVM.  E.g. because of limitations in
> > +        * KVM's uAPI, a memory fault can be encountered after an MMIO exit is
> > +        * initiated and kvm_run.mmio is filled.  Isolating these structures
> > +        * from the primary KVM_EXIT_* union ensures that KVM won't clobber
> > +        * information for the original exit.
> > +        */
> > +       union {
> >                 /* KVM_EXIT_MEMORY_FAULT */
> >                 blahblahblah
> > +#endif
> >         };
> >
> > +#ifdef __KERNEL__
> > +       /*
> > +        * Non-ABI, kernel-only field that KVM uses to detect bugs related to
> > +        * filling exit_reason and the exit unions, e.g. to guard against
> > +        * clobbering a previous exit.
> > +        */
> > +       __u64 exit_canary;
> > +#endif
> > +
> 
> We can't set exit_reason in the kvm_handle_guest_uaccess_fault()
> helper if we're to handle case A (the setup vcpu exit -> fail guest
> memory access -> return to userspace) correctly. But then userspace
> needs some other way to check whether an efault is annotated, and
> might as well check the canary, so something like
> 
> > +       __u32 speculative_exit_reason;

No need for a full 32-bit value, or even a separate field, we can use kvm_run.flags.
Ugh, but of course flags' usage is arch specific.  *sigh*

We can either defines a flags2 (blech), or grab the upper byte of flags for
arch agnostic flags (slightly less blech).

Regarding the canary, if we want to use it for WARN_ON_ONCE(), it can't be
exposed to userspace.  Either that or we need to guard the WARN in some way.

> > +       union {
> > +               /* KVM_SPEC_EXIT_MEMORY_FAULT */

Definitely just KVM_EXIT_MEMORY_FAULT, the vast, vast majority of exits to
userspace will not be speculative in any way.

> > +               struct {
> > +                       __u64 flags;
> > +                       __u64 gpa;
> > +                       __u64 len;
> > +               } memory_fault;
> > +               /* Fix the size of the union. */
> > +               char speculative_padding[256];
> > +       };
> 
> With the condition for an annotated efault being "if kvm_run returns
> -EFAULT and speculative_exit_reason is..."
Anish Moorthy Aug. 18, 2023, 5:32 p.m. UTC | #21
On Thu, Aug 17, 2023 at 4:58 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Aug 16, 2023, Anish Moorthy wrote:
> > > but the names just need to be unique, e.g. the below compiles just fine.  So unless
> > > someone has a better idea, using a separate union for exits that might be clobbered
> > > seems like the way to go.
> >
> > Agreed. By the way, what was the reason why you wanted to avoid the
> > exit reason canary being ABI?
>
> Because it doesn't need to be exposed to usersepace, and it would be quite
> unfortunate if we ever need/want to drop the canary, but can't because it's exposed
> to userspace.

> No need for a full 32-bit value, or even a separate field, we can use kvm_run.flags.
> Ugh, but of course flags' usage is arch specific.  *sigh*

Ah, I realise now you're thinking of separating the canary and
whatever userspace reads to check for an annotated memory fault. I
think that even if one variable in kvm_run did double-duty for now,
we'd always be able to switch to using another as the canary without
changing the ABI. But I'm on board with separating them anyways.

> Regarding the canary, if we want to use it for WARN_ON_ONCE(), it can't be
> exposed to userspace.  Either that or we need to guard the WARN in some way.

It's guarded behind a kconfig atm, although if we decide to drop the
userspace-visible canary then I'll drop that bit.

> > On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> > > +       __u32 speculative_exit_reason;
> ...
> We can either defines a flags2 (blech), or grab the upper byte of flags for
> arch agnostic flags (slightly less blech).

Grabbing the upper byte seems reasonable: but do you anticipate KVM
ever supporting more than eight of these speculative exits? Because if
so then it seems like it'd be less trouble to use a separate u32 or
u16 (or even u8, judging by the number of KVM exits). Not sure how
much future-proofing is appropriate here :)

>
> > > +       union {
> > > +               /* KVM_SPEC_EXIT_MEMORY_FAULT */
>
> Definitely just KVM_EXIT_MEMORY_FAULT, the vast, vast majority of exits to
> userspace will not be speculative in any way.

Speaking of future-proofing, this was me trying to anticipate future
uses of the speculative exit struct: I figured that some case might
come along where KVM_RUN returns 0 *and* the contents of the speculative
exit struct might be useful- it'd be weird to look for KVM_EXIT_*s in
two different fields.
Sean Christopherson Aug. 23, 2023, 10:20 p.m. UTC | #22
On Fri, Aug 18, 2023, Anish Moorthy wrote:
> On Thu, Aug 17, 2023 at 4:58 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Aug 16, 2023, Anish Moorthy wrote:
> > > > but the names just need to be unique, e.g. the below compiles just fine.  So unless
> > > > someone has a better idea, using a separate union for exits that might be clobbered
> > > > seems like the way to go.
> > >
> > > Agreed. By the way, what was the reason why you wanted to avoid the
> > > exit reason canary being ABI?
> >
> > Because it doesn't need to be exposed to usersepace, and it would be quite
> > unfortunate if we ever need/want to drop the canary, but can't because it's exposed
> > to userspace.
> 
> > No need for a full 32-bit value, or even a separate field, we can use kvm_run.flags.
> > Ugh, but of course flags' usage is arch specific.  *sigh*
> 
> Ah, I realise now you're thinking of separating the canary and
> whatever userspace reads to check for an annotated memory fault. I
> think that even if one variable in kvm_run did double-duty for now,
> we'd always be able to switch to using another as the canary without
> changing the ABI. But I'm on board with separating them anyways.
> 
> > Regarding the canary, if we want to use it for WARN_ON_ONCE(), it can't be
> > exposed to userspace.  Either that or we need to guard the WARN in some way.
> 
> It's guarded behind a kconfig atm, although if we decide to drop the
> userspace-visible canary then I'll drop that bit.

Yeah, burning a kconfig for this probably overkill.

> > > On Wed, Jun 14, 2023 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > +       __u32 speculative_exit_reason;
> > ...
> > We can either defines a flags2 (blech), or grab the upper byte of flags for
> > arch agnostic flags (slightly less blech).
> 
> Grabbing the upper byte seems reasonable: but do you anticipate KVM
> ever supporting more than eight of these speculative exits? Because if
> so then it seems like it'd be less trouble to use a separate u32 or
> u16 (or even u8, judging by the number of KVM exits). Not sure how
> much future-proofing is appropriate here :)

I don't anticipate anything beyond the memory fault case.  We essentially already
treat incomplete exits to userspace as KVM bugs.   MMIO is the only other case I
can think of where KVM doesn't complete an exit to usersapce, but that one is
essentially getting grandfathered in because of x86's flawed MMIO handling.
Userspace memory faults also get grandfathered in because of paravirt ABIs, i.e.
KVM is effectively required to ignore some faults due to external forces.

In other words, the bar for adding "speculative" exit to userspace is very high.

> > > > +       union {
> > > > +               /* KVM_SPEC_EXIT_MEMORY_FAULT */
> >
> > Definitely just KVM_EXIT_MEMORY_FAULT, the vast, vast majority of exits to
> > userspace will not be speculative in any way.
> 
> Speaking of future-proofing, this was me trying to anticipate future
> uses of the speculative exit struct: I figured that some case might
> come along where KVM_RUN returns 0 *and* the contents of the speculative
> exit struct might be useful- it'd be weird to look for KVM_EXIT_*s in
> two different fields.

That can be handled with a comment about the new flag, e.g.

/*
 * Set if KVM filled the memory_fault field since the start of KVM_RUN.  Note,
 * memory_fault is guaranteed to be fresh if and only KVM_RUN returns -EFAULT.
 * For all other return values, memory_fault may be stale and should be
 * considered informational only, e.g. can captured to aid debug, but shouldn't
 * be relied on for correctness.
 */
#define 	KVM_RUN_MEMORY_FAULT_FILLED
Anish Moorthy Aug. 23, 2023, 11:38 p.m. UTC | #23
On Wed, Aug 23, 2023 at 3:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> I don't anticipate anything beyond the memory fault case.  We essentially already
> treat incomplete exits to userspace as KVM bugs.   MMIO is the only other case I
> can think of where KVM doesn't complete an exit to usersapce, but that one is
> essentially getting grandfathered in because of x86's flawed MMIO handling.
> Userspace memory faults also get grandfathered in because of paravirt ABIs, i.e.
> KVM is effectively required to ignore some faults due to external forces.

Well that's good to hear. Are you sure that we don't want to add even
just a dedicated u8 to indicate the speculative exit reason though?
I'm just thinking that the different structs in speculative_exit will
be mutually exclusive, whereas flags/bitfields usually indicate
non-mutually exclusive conditions.
Sean Christopherson Aug. 24, 2023, 5:24 p.m. UTC | #24
On Wed, Aug 23, 2023, Anish Moorthy wrote:
> On Wed, Aug 23, 2023 at 3:20 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > I don't anticipate anything beyond the memory fault case.  We essentially already
> > treat incomplete exits to userspace as KVM bugs.   MMIO is the only other case I
> > can think of where KVM doesn't complete an exit to usersapce, but that one is
> > essentially getting grandfathered in because of x86's flawed MMIO handling.
> > Userspace memory faults also get grandfathered in because of paravirt ABIs, i.e.
> > KVM is effectively required to ignore some faults due to external forces.
> 
> Well that's good to hear. Are you sure that we don't want to add even
> just a dedicated u8 to indicate the speculative exit reason though?

Pretty sure.

> I'm just thinking that the different structs in speculative_exit will
> be mutually exclusive,

Given that we have no idea what the next "speculative" exit might be, I don't
think it's safe to assume that the next one will be mutually exclusive with
memory_fault.  I'm essentially betting that we'll never have more than 8
"speculative" exit types, which IMO is a pretty safe bet.

> whereas flags/bitfields usually indicate non-mutually exclusive conditions.
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index add067793b90..5b24059143b3 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6700,6 +6700,18 @@  array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 len; /* in bytes */
+		} memory_fault;
+
+Indicates a vCPU memory fault on the guest physical address range
+[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details.
+
 ::
 
     /* KVM_EXIT_NOTIFY */
@@ -7734,6 +7746,36 @@  This capability is aimed to mitigate the threat that malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_MEMORY_FAULT_INFO
+------------------------------
+
+:Architectures: x86, arm64
+:Returns: -EINVAL.
+
+The presence of this capability indicates that KVM_RUN may annotate EFAULTs
+returned by KVM_RUN in response to failed vCPU guest memory accesses which
+userspace may be able to resolve.
+
+The annotation is returned via the run struct. When KVM_RUN returns an error
+with errno=EFAULT, userspace may check the exit reason: if it is
+KVM_EXIT_MEMORY_FAULT, userspace is then permitted to read the run struct's
+'memory_fault' field.
+
+This capability is informational only: attempts to KVM_ENABLE_CAP it directly
+will fail.
+
+The 'gpa' and 'len' (in bytes) fields describe the range of guest
+physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' is a
+bitfield indicating the nature of the access: valid masks are
+
+  - KVM_MEMORY_FAULT_FLAG_WRITE:     The failed access was a write.
+  - KVM_MEMORY_FAULT_FLAG_EXEC:      The failed access was an exec.
+
+NOTE: The implementation of this capability is incomplete. Even with it enabled,
+userspace may receive "bare" EFAULTs (i.e. exit reason != KVM_EXIT_MEMORY_FAULT)
+from KVM_RUN for failures which may be resolvable. These should be considered
+bugs and reported to the maintainers so that annotations can be added.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 14391826241c..b34cf0cedffa 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -234,6 +234,7 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
 	case KVM_CAP_IRQFD_RESAMPLE:
 	case KVM_CAP_COUNTER_OFFSET:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a7725d41570a..d15bacb3f634 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4497,6 +4497,7 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
 	case KVM_CAP_IRQFD_RESAMPLE:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e571e973bc2..69a221f71914 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2288,4 +2288,13 @@  static inline void kvm_account_pgtable_pages(void *virt, int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+/*
+ * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
+ * populate the memory_fault field with the given information.
+ *
+ * WARNs and does nothing if the exit reason is not KVM_EXIT_UNKNOWN, or if
+ * 'vcpu' is not the current running vcpu.
+ */
+inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
+				     uint64_t gpa, uint64_t len, uint64_t flags);
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 737318b1c1d9..143abb334f56 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@  struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -510,6 +511,12 @@  struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 len;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -1190,6 +1197,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
 #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
 #define KVM_CAP_COUNTER_OFFSET 227
+#define KVM_CAP_MEMORY_FAULT_INFO 228
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2245,4 +2253,9 @@  struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* flags for KVM_CAP_MEMORY_FAULT_INFO */
+
+#define KVM_MEMORY_FAULT_FLAG_WRITE    (1 << 0)
+#define KVM_MEMORY_FAULT_FLAG_EXEC     (1 << 1)
+
 #endif /* __LINUX_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 4003a166328c..5476fe169921 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@  struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -505,6 +506,12 @@  struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+			__u64 flags;
+			__u64 gpa;
+			__u64 len;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 298c4372fb1a..7d7e9f893fd5 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1868,6 +1868,7 @@  static struct exit_reason {
 #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
 	KVM_EXIT_STRING(MEMORY_NOT_PRESENT),
 #endif
+	KVM_EXIT_STRING(MEMORY_FAULT),
 };
 
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fd80a560378c..09d4d85691e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4674,6 +4674,9 @@  static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 
 		return r;
 	}
+	case KVM_CAP_MEMORY_FAULT_INFO: {
+		return -EINVAL;
+	}
 	default:
 		return kvm_vm_ioctl_enable_cap(kvm, cap);
 	}
@@ -6173,3 +6176,35 @@  int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
 
 	return init_context.err;
 }
+
+inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
+				     uint64_t gpa, uint64_t len, uint64_t flags)
+{
+	if (WARN_ON_ONCE(!vcpu))
+		return;
+
+	preempt_disable();
+	/*
+	 * Ensure the this vCPU isn't modifying another vCPU's run struct, which
+	 * would open the door for races between concurrent calls to this
+	 * function.
+	 */
+	if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
+		goto out;
+	/*
+	 * Try not to overwrite an already-populated run struct.
+	 * This isn't a perfect solution, as there's no guarantee that the exit
+	 * reason is set before the run struct is populated, but it should prevent
+	 * at least some bugs.
+	 */
+	else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
+		goto out;
+
+	vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+	vcpu->run->memory_fault.gpa = gpa;
+	vcpu->run->memory_fault.len = len;
+	vcpu->run->memory_fault.flags = flags;
+
+out:
+	preempt_enable();
+}